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Preface 



These are the conference proceedings of the 4th International Conference on 
Discovery Science (DS 2001). Although discovery is naturally ubiquitous in sci- 
ence, and scientific discovery itself has been subject to scientific investigation for 
centuries, the term Discovery Science is comparably new. It came up in connec- 
tion with the Japanese Discovery Science project (cf. Arikawa’s invited lecture 
on The Discovery Science Project in Japan in the present volume) some time 
during the last few years. 

Setsuo Arikawa is the father in spirit of the Discovery Science conference 
series. He led the above mentioned project, and he is currently serving as the 
chairman of the international steering committee for the Discovery Science con- 
ference series. The other members of this board are currently (in alphabetical 
order) Klaus P. Jantke, Masahiko Sato, Ayumi Shinohara, Carl H. Smith, and 
Thomas Zeugmann. 

Colleagues and friends from all over the world took the opportunity of meet- 
ing for this conference to celebrate Arikawa’s 60th birthday and to pay tribute 
to his manifold contributions to science, in general, and to Learning Theory and 
Discovery Science, in particular. 

Algorithmic Learning Theory (ALT, for short) is another conference series 
initiated by Setsuo Arikawa in Japan in 1990. In 1994, it amalgamated with the 
conference series on Analogical and Inductive Inference (All), when ALT was 
held outside of Japan for the first time. 

This year, ALT 2001 and DS 2001 were co-located in Washington D.C., held 
in parallel and sharing five invited talks and all social events. The proceedings 
of ALT 2001 are published as a twin volume of the present one as LNAI 2225. 

The present volume is organized in three parts. The first part contains the 
five invited lectures of ALT 2001 and DS 2001 exactly in the order in which they 
appeared in the conferences’ common advance program. The invited speakers are 
Setsuo Arikawa, Lindley Darden, Dana Angluin, Ben Shneiderman, and Paul R. 
Cohen. Because their talks were invited to both conferences, there had to be 
found a modus vivendi for publication. This volume contains the full versions 
of Lindley Darden’s and Ben Shneiderman’s paper as well as abstracts of the 
others. 

The second part contains the accepted 30 regular papers of the DS 2001 
conference. Last but not least, there is a third part with written versions of 
the posters accepted for presentation during the conference. In a sense, DS 2001 
posters are posters of ALT 2001 as well, because both events shared a conference 
venue including the exhibition area for the posters. 

The combination of ALT 2001 and DS 2001 allowed for an especially com- 
prehensive treatment of the issues ranging from rather theoretical investigations 
to applications and to both psychological and sociological topics. The organizers 
consider this an attractive approach to both communities. 

Over the past dozen or so years, many enterprises have begun to routinely 
capture paramount volumes of data describing their operations, products, ser- 
vices, and customers. Simultaneously, scientists and engineers have been record- 
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ing experimental data of a continuously growing size covering experience in many 
fields. The finer the measurement granularity of the engineers’ equipment and 
the more computer power available to support scientific experiments, the larger 
the amounts of data captured. These huge collections of bits and bytes constitute 
a new challenge to those who try to separate the wheat from the chaff. 

Potentially, there is more fruitful knowledge hiding in huge amounts of data, 
but combinatorially, there is even more rubbish. It requires a new dimension of 
technological investment to extract useful information, and humans must attack 
these problems differently. Discovery Science deals with all aspects of promoting 
scientific discovery, and it is changing its character within a changing world. 

New questions are being asked and leading to innovative concepts. Concep- 
tualizations are setting the stage for asking new questions. Under these cir- 
cumstances, new ways of looking at the problems might arise. More traditional 
disciplines are invoked and innovative ideas are made precise to get computers 
involved in knowledge discovery. Autonomously working machinery is necessary 
to deal with the flood of data, thus learning becomes a core technology of discov- 
ery science. When all said and done, humans and machines must learn together 
and support each other. 

The field of Discovery Science is evolving and frequently changing its appear- 
ance. The Discovery Science conference series aims at reflecting this development, 
summarizing the state of affair and helping humans to navigate in such an ex- 
citing environment. 
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The Discovery Science Project in Japan 



Setsuo Arikawa 

Department of Informatics, Kyushu University 
Fukuoka 812-8581, Japan 
arikawaSi . kyushu-u .ac.jp 

Abstract. The Discovery Science project in Japan in which more than sixty 
scientists participated was a three-year project sponsored by Grant-in-Aid for 
Scientific Research on Priority Area from the Ministry of Education, Culture, 
Sports, Science and Technology (MEXT) of Japan. This project mainly aimed 
to (1) develop new methods for knowledge discovery, (2) install network envi- 
ronments for knowledge discovery, and (3) establish Discovery Science as a new 
area of Computer Science / Artificial Intelligence Study. 

In order to attain these aims we set up five groups for studying the following 
research areas: 

(A) Logic for/of Knowledge Discovery 

(B) Knowledge Discovery by Inference/Reasoning 

(C) Knowledge Discovery Based on Computational Learning Theory 

(D) Knowledge Discovery in Huge Database and Data Mining 

(E) Knowledge Discovery in Network Environments 

These research areas and related topics can be regarded as a preliminary def- 
inition of Discovery Science by enumeration. Thus Discovery Science ranges over 
philosophy, logic, reasoning, computational learning and system developments. 

In addition to these five research groups we organized a steering group for 
planning, adjustment and evaluation of the project. The steering group, chaired 
by the principal investigator of the project, consists of leaders of the five research 
groups and their subgroups as well as advisors from the outside of the project. 
We invited three scientists to consider the Discovery Science overlooking the 
above five research areas from viewpoints of knowledge science, natural language 
processing, and image processing, respectively. 

The group A studied discovery from a very broad perspective, taking into 
account of historical and social aspects of discovery, and computational and log- 
ical aspects of discovery. The group B focused on the role of inference/reasoning 
in knowledge discovery, and obtained many results on both theory and practice 
on statistical abduction, inductive logic programming and inductive inference. 
The group C aimed to propose and develop computational models and method- 
ologies for knowledge discovery mainly based on computational learning theory. 
This group obtained some deep theoretical results on boosting of learning al- 
gorithms and the minimax strategy for Gaussian density estimation, and also 
methodologies specialized to concrete problems such as algorithm for finding 
best subsequence patterns, biological sequence compression algorithm, text cat- 
egorization, and MDL-based compression. The group D aimed to create compu- 
tational strategy for speeding up the discovery process in total. For this purpose. 
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the group D was organized with researchers working in scientific domains and 
researchers from computer science so that real issues in the discovery process 
can be exposed out and practical computational techniques can be devised and 
tested for solving these real issues. This group handled many kinds of data: data 
from national projects such as genomic data and satellite observations, data gen- 
erated from laboratory experiments, data collected from personal interests such 
as literature and medical records, data collected in business and marketing ar- 
eas, and data for proving the efficiency of algorithms such as UCI repository. So 
many theoretical and practical results were obtained on such a variety of data. 
The group E aimed to develop a unified media system for knowledge discovery 
and network agents for knowledge discovery. This group obtained practical re- 
sults on a new virtual materialization of DB records and scientific computations 
that help scientists to make a scientific discovery, a convenient visualization in- 
terface that treats web data, and an efficient algorithm that extracts important 
information from semi-structured data in the web space. 

This lecture describes an outline of our project and the main results as well 
as how the project was prepared. We have published and are publishing special 
issues on our project from several journals [5], [6], [7], [8], [9], [10]. As an activity 
of the project we organized and sponsored Discovery Science Conference for 
three years where many papers were presented by our members [2], [3], [4]. We 
also published annual progress reports [1], which were distributed at the DS 
conferences. We are publishing the final technical report as an LNAI[llj. 

References 

1. S. Arikawa, M. Sato, T. Sato, A. Maruoka, S. Miyano, and Y. Kanada. Discovery 
Science Progress Report No.l (1998), No. 2 (1999), No. 3 (2000). Department of 
Informaties, Kyushu University. 

2. S. Arikawa and H. Motoda. Discovery Science. LNAI, Springer 1532, 1998. 

3. S. Arikawa and K. Furukawa. Discovery Science. LNAI, Springer 1721, 1999. 

4. S. Arikawa and S. Morishita. Discovery Science. LNAI, Springer 1967, 2000. 

5. H. Motoda and S. Arikawa (Eds.) Special Feature on Discovery Science. New 
Generation Computing, 18(1): 13-86, 2000. 

6. S. Miyano (Ed.) Special Issue on Surveys on Discovery Science. lEICE Transac- 
tions on Information and Systems, E83-D(l): 1-70, 2000. 

7. H. Motoda (Ed.) Special Issue on Discovery Science. Journal of Japanese Society 
for Artificial Intelligence, 15(4):592-702, 2000. 

8. S. Morishita and S. Miyano(Eds.) Discovery Science and Data Mining (in 
Japanese), bit special volume , Kyoritsu Shuppan, 2000. 

9. S. Arikawa, M. Sato, T. Sato, A. Maruoka, S. Miyano, and Y. Kanada. The 
Discovery Science Project. Journal of Japanese Society for Artificial Intelligence, 
15(4) 595-607, 2000. 

10. S. Arikawa, H. Motoda, K. Furukawa, and S. Morishita (Eds.) Theoretical Aspects 
of Discovery Science. Theoretieal Computer Scienee (to appear) 

11. S. Arikawa and A. Shinohara (Eds.) Progresses in Discovery Science. LNAI, 
Springer (2001, to appear) 




Discovering Mechanisms: A Computational 
Philosophy of Science Perspective 



Lindley Darden 

Department of Philosophy 
University of Maryland 
College Park, MD 20742 
dardenScarnap . umd . edu 

http : //www. inf orm . umd . edu/PHIL/f aculty/LDarden/ 



Abstract. A task in the philosophy of discovery is to find reasoning 
strategies for discovery, which fall into three categories: strategies for 
generation, evaluation and revision. Because mechanisms are often what 
is discovered in biology, a new characterization of mechanism aids in 
their discovery. A computational system for discovering mechanisms is 
sketched, consisting of a simulator, a library of mechanism schemas and 
components, and a discoverer for generating, evaluating and revising pro- 
posed mechanism schemas. Revisions go through stages from how possi- 
bly to how plausibly to how actually. 



1 Introduction 

Philosophers of discovery look for reasoning strategies that can guide discovery. 
This work is in the framework of Herbert Simon’s (1997) view of discovery as 
problem solving. Given a problem to be solved, such as explaining a phenomenon, 
one goal is to find a mechanism that produces that phenomenon. For example, 
given the phenomenon of the production of a protein, the goal is to find the 
mechanism of protein synthesis. The task of the philosopher of discovery is to 
find reasoning strategies to guide such discoveries. Strategies are heuristics for 
problem solving; that is, they provide guidance but do not guarantee success. 

Discovery is not viewed as something that occurs in a single a-ha moment of 
insight. Instead, discovery is construed as a process that occurs over an extended 
period of time, going through cycles of generation, evaluation, and revision (Dar- 
den 1991). 

The history of science is a source of “compiled hindsight” (Darden 1987) 
about reasoning strategies for discovering mechanisms. This paper will use ex- 
amples from the history of biology to illustrate general reasoning strategies for 
discovering mechanisms. Section 2 puts this work into the broader context of 
a matrix of biological knowledge. Section 3 discusses a new characterization of 
mechanism, based on an ontology of entities, properties, and activities. Section 
4 outlines components of a mechanism discovery system, including a simulator, 
a library of mechanism designs and components, and a discoverer. 
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Biomatrix 



Fig. 1. Matrix of Biological Knowledge 



2 Biomatrix 

This work is situated in a larger framework. In the 1980s, Harold Morowitz 
(1985) chaired a National Academy of Sciences workshop on models in biology. 
As a result of that workshop, a society was formed with the name, “Biomatrix: 
A Society for Biological Computation and Informatics” (Morowitz and Smith 
1987). This society was ahead of its time; it has splintered into different groups 
and its grand vision has yet to be realized. Nonetheless, its vision is worth revisit- 
ing in order to put the work to be discussed in this paper into a broader context. 
As Figure 1 shows, the biomatrix vision included relations among three areas: 
first, databases; second, information storage and retrieval by literature cataloging 
(e.g., Medline); and, third, artificial intelligence and knowledge bases. Discovery 
science has worked in all three areas since the 1980s. Knowledge discovery in 
databases is a booming area (e.g., Piatetsky-Shapiro and Frawley, eds., 1991). 
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Discovery using abstracts available from literature catalogues has been developed 
by Don Swanson (1990) and others. The area of discovery using knowledge based 
systems is an active area, especially in computational biology. The meetings on 
Intelligent Systems in Molecular Biology and the International Society for Com- 
putational Biology arose from that part of the biomatrix. It is in the knowledge 
based systems box that my work today will fall. Relations to databases and in- 
formation retrieval as related to mechanism discovery will perhaps occur to the 
reader. 



3 Mechanisms, Schemas, and Sketches 

Often in biology, what is to be discovered is a mechanism. Physicists often aim 
to discover general laws, such as Newton’s laws of motion. However, few biolog- 
ical phenomena are best characterized by universal, mathematical laws (Beatty 
1995). The field of molecular biology, for example, studies mechanisms, such as 
the mechanisms of DNA replication, protein synthesis, and gene regulation. The 
lively area of functional genomics is now attempting to discover the mechanisms 
in which the gene sequences act. Such mechanisms include gene expression, dur- 
ing both embryological development and normal gene activities in the adult. 
The field of biochemistry also studies mechanisms when it find the activities 
that transform one stage in a pathway to next, such as the enzymes, reactants 
and products in the Krebs cycle that produces the energy molecule ATP. An 
important current scientific task is to connect genetic mechanisms studied by 
molecular biology with metabolic mechanisms studied by biochemistry. As that 
task is accomplished, science will have a unified picture of the mechanisms that 
carry out the two essential features of life according to Aristotle: reproduction 
and nutrition. 

Given this importance of mechanisms in biology, a correspondingly important 
task for discovery science is to find methods for discovering mechanisms. If the 
goal is to discover a mechanism, then the nature of that product shapes the 
process of discovery. A new characterization of mechanism aids the search for 
reasoning strategies to discover mechanisms. 

A mechanism is sought to explain how a phenomenon is produced 
(Machamer, Darden, Graver 2000) or how some task is carried out (Bechtel and 
Richardson 1993) or how the mechanism as a whole behaves (Glennan 1996). 
Mechanisms may be characterized in the following way: 

Mechanisms are entities and activities organized such that they are pro- 
ductive of regular changes from start or set-up to finish or termination 

conditions. (Machamer, Darden, Graver 2000, p. 3). 

Mechanisms are regular in that they usually work in the same way under the 
same conditions. The regularity is exhibited in the typical way that the mecha- 
nism runs from beginning to end; what makes it regular is the productive con- 
tinuity between stages. Mechanisms exhibit productive continuity without gaps 
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from the set up to the termination conditions; that is, each stage gives rise to, 
allows, drives, or makes the next. 

The ontology proposed here consists of entities, properties, and activities. 
Mechanisms are composed of both entities (with their properties) and activities. 
Activities are the producers of change. Entities are the things that engage in 
activities. Activities require that entities have specific types of properties. For 
example, two entities, a DNA base and its complement, engage in the activity 
of hydrogen bonding because of their properties of geometric shape and weak 
polar charges. 

For a given scientific field, there are typically entities and activities that are 
accepted as relatively fundamental or taken to be unproblematic for the purposes 
of a given scientist, research group, or field. That is, descriptions of mechanisms 
in that field typically bottom out somewhere. Bottoming out is relative: different 
types of entities and activities are where a given field stops when constructing its 
descriptions of mechanisms. In molecular biology, mechanisms typically bottom 
out in descriptions of the activities of cell organelles, such as the ribosome, 
and molecules, including macromolecules, smaller molecules, and ions. The most 
important kinds of activities in molecular biology are geometrico-mechanical 
and electro-chemical activities. An example of a geometrico-mechanical activity 
is the lock and key docking of an enzyme and its substrate. Electro-chemical 
activities include strong covalent bonding and weak hydrogen bonding. 

Entities and activities are interdependent (Machamer, Darden, Graver 2000, 
p. 6). For example, appropriate chemical valences are necessary for covalent 
bonding. Polar charges are necessary for hydrogen bonding. Appropriate shapes 
are necessary for lock and key docking. This interdependence of entities and 
activities allows one to reason about one, based on what is known or conjectured 
about the other, in each stage of the mechanism (Darden and Graver, in press). 

A mechanism schema is a truncated abstract description of a mechanism that 
can be filled with more specific descriptions of component entities and activities. 
An example is the following: 

DNA ^ RNA ^ protein. 

This is a diagram of the central dogma of molecular biology. It is a very abstract, 
schematic representation of the mechanism of protein synthesis. 

A schema may be even more abstract if it merely indicates functional roles 
played in the mechanism by fillers of a place in the schema (Graver 2001). Gon- 
sider the schema 

DNA ^ template ^ protein. 

The schema term “template” indicates the functional role played by the interme- 
diate between DNA and protein. Hypotheses about role-fillers changed during 
the incremental discovery of the mechanism of protein synthesis in the 1950s 
and 1960s. Thus, mechanism schemes are particularly good ways of representing 
functional roles. (For discussion of “local” and “integrated” functions and a less 
schematic way of representing them in a computational system, see Karp 2000.) 
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Table 1. Constraints on the Organization of Mechanisms 

Character of phenomenon 

Componency Constraints 

Entities and activities 
Modules 

Spatial Constraints 

Compartmentalization 

Localization 

Connectivity 

Structural 

Orientation 

Temporal Constraints 
Order 
Rate 
Duration 
Frequency 

Hierarchical Constraints 

Integration of levels 

(from Craver and Darden 2001) 



Mechanism sketches are incomplete schemas. They contain black boxes, 
which cannot yet be filled with known components. Attempts to instantiate 
a sketch would leave a gap in the productive continuity; that is, knowledge of 
the needed particular entities and activities is missing. Thus, sketches indicate 
what needs to be discovered in order to find a mechanism schema. 

Once a schema is found and instantiated, a detailed description of a mecha- 
nism results. For example, a more detailed description of the protein synthesis 
mechanism (often depicted in diagrams) satisfies the constraints that any ade- 
quate description of a mechanism must satisfy. It shows how the phenomenon, 
the synthesis of a protein, is carried out by the operation of the mechanism. 
It depicts the entities-DNA, RNA, and amino acids-as well as implicitly, the 
activities. Hydrogen bonding is the activity operating when messenger RNA is 
copied from DNA. There is a geometrico-mechanical docking of the messenger 
RNA and the ribosome, a particle in the cytoplasm. Hydrogen bonding again 
occurs as the codons on messenger RNA bond to the anticodons on transfer 
RNAs carrying amino acids. Finally, covalent bonding is the activity that links 
the amino acids together in the protein. Good mechanism descriptions show the 
spatial relations of the components and the temporal order of the stages. 

A detailed description of a mechanism satisfies several general constraints. 
(They are listed in Table 1 and indicated here by italics.) There is a phenomenon 
that the mechanism, when working, produces, for example, the synthesis of a pro- 
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tein. The nature of the phenomenon, which may be recharacterized as research 
on it proceeds, constrains details about the mechanism that produces it. For 
example, the components of the mechanism, the entities and activities, must be 
adequate to synthesize a protein, composed of amino acids tightly covalently 
bonded to each other. There are various spatial constraints. The DNA is located 
in the nucleus (in eucaryotes) and the rest of the machinery is in the cytoplasm. 
The ribosome is a particle with a two part structure that allows it to attach to 
the messenger RNA and orient the codons of the messenger so that particular 
transfer RNAs can hydrogen bond to them. There is a particular order in which 
the steps occur and they take certain amounts of time. All of these constraints 
can play roles in the search for mechanisms, and, then, they become part of an 
adequate description of a mechanism. (For more discussion of these constraints, 
see Graver and Darden 2001.) 

From this list of constraints on an adequate description of a mechanism, it is 
evident that mere equations do not adequately represent the numerous features 
of a mechanism, especially spatial constraints. Diagrams that depict structural 
features, spatial relations and temporal sequences are good representations of 
mechanisms. 

To sum up so far: Recent work has provided this new characterization of what 
a mechanism is, the constraints that any adequate description of a mechanism 
must satisfy, and an analysis of abstract mechanism schemas and incomplete 
mechanism sketches that can play roles in guiding discovery. 



4 Outline of a System for Constructing Hypothetical 
Mechanisms 

Components of a computational system for discovering mechanisms are outlined 
in Figure 2. They include a simulator, a hypothesized mechanism schema, a 
discoverer with reasoning strategies for generation, evaluation, and revision, and 
a searchable, indexed library. 



4.1 Simulator 

The goal is to construct a simulator that adequately simulates a biological mech- 
anism. Given the set up conditions, the simulator can be used to predict specific 
termination conditions. The simulator is an instantiation of a mechanism schema. 
It may contain more or less detail about the specific component entities and ac- 
tivities and their structural, spatial and temporal organization. From a human 
factors perspective, a video option to display the mechanism simulation in action 
would aid the user in seeing what the mechanism is doing at each stage. The 
video could be stopped at each stage and details of the entities and activities of 
that stage examined in more detail. 
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Fig. 2. Outline for a Mechanism Discovery System 



4.2 Library 

A mechanism schema is discovered by iterating through stages of generation, 
evaluation, and revision. Generation is accomplished by several steps. First, a 
phenomenon to be explained must be characterized. Its mode of description will 
guide the search for schemas that can produce it. Search occurs within a library, 
consisting of several types of entries: types of schemas, types of modules, types 
of entities, and types of activities. 

The search among types of schemas is a search for an abstraction of an 
analogous mechanism (on analogies and schemas, see, e.g., Holyoak and Thagard 
1995). Kevin Dunbar (1995) has shown that molecular biologists often use “local 
analogies” to similar mechanisms in their own field and “regional analogies” to 
mechanisms in other, neighboring fields. Such analogies are good sources from 
which to abstract mechanism schemas. 

Types of schemas, modules, entities and activities are interconnected. A par- 
ticular type of schema, for example, a gene regulation schema, may suggest one 
or more types of modules, such as derepression or negative feedback modules. A 
type of entity will have activity-enabling properties that indicate it can produce 
a type of activity. Conversely, a type of activity will require particular types of 
entities. For example, nucleic acids have polar charged bases that enable them 
to engage in the activity of hydrogen bonding, a weak form of chemical bonding 
that can be easily formed and broken between polar molecules. 

Schemas may be indexed by the kind of phenomenon they produce. For ex- 
ample, for the phenomenon of producing an adaptation, two types of mechanisms 
have been proposed historically by biologists-selective mechanisms and instruc- 
tive mechanisms (Darden, 1987). At a high degree of abstraction, a selection 
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schema may be characterized as follows: first comes a stage of variant produc- 
tion; next comes a stage with a selective interaction that poses a challenge to 
the variants; this is followed by differential benefit for some of the variants. In 
contrast, instructive mechanisms have a coupling between the stage of variant 
production and the selective environment, so that an instruction is sent from 
the environment and interpreted by the adaptive system to produce only the 
required variant. In evolutionary biology and immunology, selective mechanisms 
rather than instructive ones have been shown to work in producing evolutionary 
adaptations and clones of antibody cells (Darden and Cain 1989). 

A library of modules can be indexed by the functional roles they can fulfill 
in a schema (e.g., Goel and Chandrasekaran 1989). For example, if a schema re- 
quires end-product inhibition, then a feedback control module can be added to 
the linear schema. If cell-to-cell signaling is indicated, then membrane spanning 
proteins serving as receptors are a likely kind of module. Entities and activities 
can be categorized in numerous ways. Types of macromolecules include nucleic 
acids, proteins, and carbohydrates. When proteins, for example, perform func- 
tions, such as enzymes that catalyze reactions, then the kind of function, such 
as phosphorylation, is a useful indexing method. 



4.3 Discoverer: Generation, Evaluation, Revision 

During generation, after a phenomenon is characterized, then a search is made to 
see if an entire schema can be found that produces such a type of phenomenon. 
If an entire schema can be found, such as a selective or instructive schema, then 
generation can proceed to further specification with types of modules, entities, 
and activities. If no entire schema is available, then modules may be put together 
piecemeal to fulfill various functional roles. If functional roles and modules to 
fill them are not yet known, then reasoning about types of entities and activities 
is available. By starting from known set up conditions, or, conversely, from the 
end product, a hypothesized string of entities and activities can be constructed. 
Reasoning forward from the beginning or backward from the end product of the 
mechanism will allow gaps in the middle to be filled. In sum, reasoning strategies 
for discovering mechanisms include schema instantiation, modular subassembly, 
and forward chaining/backtracking (Darden, forthcoming). 



Evaluation. Once one or more hypothesized mechanism schemas are found 
or constructed piecemeal, then evaluation occurs. Evaluation proceeds through 
stages from how possibly to how plausibly to how actually. (Peter Machamer 
suggests that “how actually” is best read as “how most plausibly,” given that 
all scientific claims are contingent, that is, subject to revision in the light of new 
evidence.) 

How possibly a mechanism operates can be shown by building a simulator 
that begins with the set up conditions and produces the termination conditions 
by moving through hypothesized intermediate stages. As additional constraints 
are fulfilled and evaluation strategies applied, the proposed mechanism becomes 
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Table 2. Strategies for Theory Evaluation 

1. Internally consistent and nontautologous 

2. Systematicity vs. modularity 

3. Clarity 

4. Explanatory adequacy 

5. Predictive adequacy 

6. Scope and generality 

7. Lack of ad hocness 

8. Extendability and fruitfulness 

9. Relations with other accepted theories 

10. Metaphysical and methodological constraints 

11. Relation to rivals 



(from Darden 1991, p. 257) 



more plausible. The constraints of Table 1 must be satisfied. Table 2 (from 
Darden 1991, Table 15-2) lists strategies for theory assessment often employed by 
philosophers of science. A working simulator will likely show that the proposed 
schema is internally consistent and consists of modules whose functioning is 
clearly understood, thus satisfying some of the conditions listed in 1-3. If the 
simulator can be run to produce the phenomenon to be explained, then condition 
4 of explanatory adequacy is at least partially fulfilled. Testing a prediction 
against data is often viewed as the most important evaluation strategy. The 
simulator can be run with different initial conditions to produce predictions, 
which can be tested against data. If a prediction does not match a data point, 
then an anomaly results and revision is required. We will omit further discussion 
of the other strategies for theory assessment in order to turn our attention to 
anomaly resolution strategies to use when revision is required. 



Anomaly resolution. When a prediction does not match a data point, then 
an anomaly results. Strategies for anomaly resolution require a number of infor- 
mation processing tasks to be carried out. In previous work with John Josephson 
and Dale Moberg, we investigated computational implementation of such tasks 
(Moberg and Josephson 1990; Darden et al. 1992; Darden 1998). A list of such 
tasks is found in Figure 3. 

Reasoning in anomaly resolution is, first, a diagnostic reasoning task, to lo- 
calize the site(s) of failure, and, then, a redesign task, to improve the simulation 
to remove the problem. Characterizing the exact difference between the predic- 
tion and the data point is a first step. Peter Karp (1990; 1993) discussed this step 
of anomaly resolution in his implementation of the MOLGEN system to resolve 
anomalies in a molecular biology simulator. One wants to milk the anomaly itself 
for all the information one can get about the nature of the failure. Often during 
diagnosis, the nature of the anomaly allows failures to be localized to one part 
of the system rather than others, sometimes to a specific site. 
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simulator + initial conditions ► 

observed result ► 



add'l information ► 

(e.g., from research program) 

add'l information ^ 
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retry redesign 
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Construct alternative redesign h’s 
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test modified theory (resimulate) 
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(e.g., relation to other theories?) 

i 

incorporate in explanatory repertoire 



Fig. 3. Information Processing Tasks in Anomaly Resolution (from Darden 1998, p. 69) 
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Once hypothesized localizations are found by doing credit assignment, then 
alternative redesign hypotheses for that module can be constructed. Once again, 
the library of modules, entities and activities can be consulted to find plausi- 
ble candidates. The newly redesigned simulator can be run again to see if the 
problem is fixed and the prediction now matches the data point. 

5 Piecemeal Discovery and Hierarchical Integration 

The view of scientific discovery proposed here is that discovery of mechanisms 
occurs in extended episodes of cycles of generation, evaluation, and revision. In 
so far as the constraints are satisfied, the assessment strategies are applied, and 
any anomalies are resolved, then the hypothesized mechanism will have moved 
through the stages of how possibly to how plausibly to how actually. A new 
mechanism will have been discovered. 

Once a new mechanism at a given mechanism level has been discovered, then 
that mechanism needs to be situated within the context of other biological mech- 
anisms. Thus, the general strategy for theory evaluation of consistent relations 
with other accepted theories in other fields of science (see Table 2, strategy 9) is 
reinterpreted. By thinking about theories as mechanism schemas, the strategy 
gets implemented by situating the hypothesized mechanism in a larger context. 
This larger context consists of mechanisms that occur before and after it, as well 
as mechanisms up or down in a mechanism hierarchy (Graver 2001). Biologi- 
cal mechanisms are nested within other mechanisms, and finding such a fit in 
an integrated picture is another measure of the adequacy of a newly proposed 
mechanism. 

6 Conclusion 

Integrated mechanism schemas can serve as the scaffolding of the biological 
matrix. They provide a framework to integrate general biological knowledge 
of mechanisms, the data that provide evidence for such mechanisms, and the 
reports in the literature of research to discover mechanisms. 

This paper has discussed a new characterization of mechanism, based on 
an ontology of entities, properties, and activities, and has outlined components 
of a computational system for discovering mechanisms. Discovery is viewed as 
an extended process, requiring reasoning strategies for generation, evaluation, 
and revision of hypothesized mechanism schemas. Discovery moves through the 
stages of from how possibly to how plausibly to how actually a mechanism works. 
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Abstract. We begin with a brief tutorial on the problem of learning a h- 
nite concept class over a hnite domain using membership queries and/or 
equivalence queries. We then sketch general results on the number of 
queries needed to learn a class of concepts, focusing on the various no- 
tions of combinatorial dimension that have been employed, including the 
teaching dimension, the exclusion dimension, the extended teaching di- 
mension, the fingerprint dimension, the sample exclusion dimension, the 
Vapnik-Chervonenkis dimension, the abstract identification dimension, 
and the general dimension. 
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Abstract. The growing use of information visualization tools and data mining 
algorithms stems from two separate lines of research. Information visualization 
researchers believe in the importance of giving users an overview and insight 
into the data distributions, while data mining researchers believe that statistical 
algorithms and machine learning can be relied on to find the interesting patterns. 
This paper discusses two issues that influence design of discovery tools: statistical 
algorithms vs. visual data presentation, and hypothesis testing vs. exploratory 
data analysis. I claim that a combined approach could lead to novel discovery 
tools that preserve user control, enable more effective exploration, and promote 
responsibility. 



1 Introduction 



Genomics researchers, financial analysts, and social scientists hunt for patterns in vast 
data warehouses using increasingly powerful software tools. These tools are based on 
emerging concepts such as knowledge discovery, data mining, and information visual- 
ization. They also employ specialized methods such as neural networks, decisions trees, 
principal components analysis, and a hundred others. 

Computers have made it possible to conduct complex statistical analyses that would 
have been prohibitive to carry out in the past. However, the dangers of using complex 
computer software grow when user comprehension and control are diminished. There- 
fore, it seems useful to reflect on the underlying philosophy and appropriateness of the 
diverse methods that have been proposed. This could lead to better understandings of 
when to use given tools and methods, as well as contribute to the invention of new 
discovery tools and refinement of existing ones. 

Each tool conveys an outlook about the importance of human initiative and control 
as contrasted with machine intelligence and power [16]. The conclusion deals with the 
central issue of responsibility for failures and successes. Many issues influence design 
of discovery tools, but I focus on two: statistical algorithms vs. visual data presentation 
and hypothesis testing vs. exploratory data analysis. 

* Keynote for Discovery Science 2001 Conference, November 25-28, 2001, Washington, DC. 

Also to appear in Information Visualization, new journal by Palgrave/MacMillan. 
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2 Statistical Algorithms vs. Visual Data Presentation 

Early efforts to summarize data generated means, medians, standard deviations, and 
ranges. These numbers were helpful because their compactness, relative to the full data 
set, and their clarity supported understanding, comparisons, and decision making. Sum- 
mary statistics appealed to the rational thinkers who were attracted to the objective nature 
of data comparisons that avoided human subjectivity. However, they also hid interesting 
features such as whether distributions were uniform, normal, skewed, bi-modal, or dis- 
torted by outliers. A remedy to these problems was the presentation of data as a visual 
plot so interesting features could be seen by a human researcher. 

The invention of times-series plots and statistical graphics for economic data is 
usually attributed to William Playfair (1759-1823) who published The Commercial and 
Political Atlas in 1786 in London. Visual presentations can be very powerful in revealing 
trends, highlighting outliers, showing clusters, and exposing gaps. Visual presentations 
can give users a richer sense of what is happening in the data and suggest possible 
directions for further study. Visual presentations speak to the intuitive side and the 
sense-making spirit that is part of exploration. Of course visual presentations have their 
limitations in terms of dealing with large data sets, occlusion of data, disorientation, and 
misinterpretation. 

By early in the 20th century statistical approaches, encouraged by the Age of Ra- 
tionalism, became prevalent in many scientific domains. Ronald Fisher (1890-1962) 
developed modern statistical methods for experimental designs related to his extensive 
agricultural studies. His development of analysis of variance for design of factorial ex- 
periments [7] helped advance scientific research in many fields [12]. His approaches 
are still widely used in cognitive psychology and have influenced most experimental 
sciences. 

The appearance of computers heightened the importance of this issue. Computers 
can be used to carry out far more complex statistical algorithms and they also be used to 
generate rich visual, animated, and user-controlled displays. Typical presentation of sta- 
tistical data mining results is by brief summary tables, induced rules, or decision trees. 
Typical visual data presentations show data-rich histograms, scattergrams, heatmaps, 
treemaps, dendrograms, parallel coordinates, etc. in multiple coordinated windows that 
support user-controlled exploration with dynamic queries for filtering (Fig. 1 ). Compara- 
tive studies of statistical summaries and visual presentations demonstrate the importance 
of user familiarity and training with each approach and the influence of specific tasks. 
Of course, statistical summaries and visual presentations can both be misleading or 
confusing. 
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dramatic outliers: radon and helium. 
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An example may help clarify the distinction. Promoters of statistical methods may 
use linear correlation coefficients to detect relationships between variables, which works 
wonderfully when there is a linear relationship between variables and when the data is 
free from anomalies. However, if the relationship is quadratic (or exponential, sinusoidal, 
etc.) a linear algorithm may fail to detect the relationship. Similarly if there are data 
collection problems that add outliers or if there are discontinuities over the range (e.g. 
freezing or boiling points of water), then linear correlation may fail. A visual presentation 
is more likely to help researchers find such phenomena and suggest richer hypotheses. 



3 Hypothesis Testing vs. Exploratory Data Analysis 



Fisher’s approach not only promoted statistical methods over visual presentations, but 
also strongly endorsed theory-driven hypothesis-testing research over casual observation 
and exploratory data analysis. This philosophical strand goes back to Francis Bacon 
(1551-1626) and later to John Herschel’s 1830 A Preliminary Discourse on the Study 
of Natural Philosophy. They are usually credited with influencing modern notions of 
scientific methods based on rules of induction and the hypothetico-deductive method. 
Believers in scientific methods typically see controlled experiments as the fast path to 
progress, even though its use of the reductionist approach to test one variable at a time 
can be disconcertingly slow. Fisher’s invention of factorial experiments helped make 
controlled experimentation more efficient. 

Advocates of the reductionist approach and controlled experimentation argue that 
large benefits come when researchers are forced to clearly state their hypotheses in ad- 
vance of data collection. This enables them to limit the number of independent variables 
and to measure a small number of dependent variables. They believe that the courageous 
act of stating hypotheses in advance sharpens thinking, leads to more parsimonious data 
collection, and encourages precise measurement. Their goals are to understand causal 
relationships, to produce replicable results, and to emerge with generalizable insights. 
Critics complain that the reductionist approach, with its laboratory conditions to ensure 
control, is too far removed from reality (not situated and therefore stripped of context) 
and therefore may ignore important variables that effect outcomes. They also argue that 
by forcing researchers to state an initial hypothesis, their observation will be biased to- 
wards finding evidence to support their hypothesis and will ignore interesting phenomena 
that are not related to their dependent variables. 

On the other side of this interesting debate are advocates of exploratory data analysis 
who believe that great gains can be made by collecting voluminous data sets and then 
searching for interesting patterns. They contend that statistical analyses and machine 
learning techniques have matured enough to reveal complex relationships that were not 
anticipated by researchers. They believe that a priori hypotheses limit research and are no 
longer needed because of the capacity of computers to collect and analyze voluminous 
data. Skeptics worry that any given set of data, no matter how large, may still be a special 
case, thereby undermining the generalizability of the results. They also question whether 
detection of strong statistical relationships can ever lead to an understanding of cause 
and effect. They declare that correlation does not imply causation. 
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Once again, an example may clarify this issue. If a semiconductor fabrication facility 
is generating a high rate of failures, promoters of hypothesis testing might list the possible 
causes, such as contaminants, excessive heat, or too rapid cooling. They might seek 
evidence to support these hypotheses and maybe conduct trial runs with the equipment 
to see if they could regenerate the problem. Promoters of exploratory data analysis might 
want to collect existing data from the past year of production under differeing conditions 
and then run data mining tools against these data sets to discover correlates of high rates 
of failure. Of course, an experienced supervisor may blend these approaches, gathering 
exploratory hypotheses from the existing data and then conducting confirmatory tests. 



4 The New Paradigms 

The emergence of the computer has shaken the methodological edifice. Complex sta- 
tistical calculations and animated visualizations become feasible. Elaborate controlled 
experiments can be run hundreds of times and exploratory data analysis has become 
widespread. Devotees of hypothesis-testing have new tools to collect data and prove 
their hypotheses. T-tests and analysis of variance (ANOVA) have been joined by linear 
and non-linear regression, complex forecasting methods, and discriminant analylsis. 

Those who believe in exploratory data analysis methods have even more new tools 
such as neural networks, rule induction, a hundred forms of automated clustering, and 
even more machine learning methods. These are often covered in the rapidly growing 
academic discipline of data mining [6,8]. Witten and Frank define data mining as "the 
extraction of implicit, previously unknown, and potentially useful information from 
data." They caution that "exaggerated reports appear of the secrets that can be uncovered 
by setting learning algorithms loose on oceans of data. But there is no magic in machine 
learning, no hidden power, no alchemy. Instead there is an identifiable body of simple 
and practical techniques that can often extract useful information from raw data." [19] 

Similarly, those who believe in data or information visualization are having a great 
time as the computer enables rapid display of large data sets with rich user control 
panels to support exploration [5]. Users can manipulate up to a million data items 
with 100-millisecond update of displays that present color-coded, size-coded markers 
for each item. With the right coding, human pre-attentive perceptual skills enable users 
to recognize patterns, spot outliers, identify gaps, and find clusters in a few hundred 
milliseconds. When data sets grow past a million items and cannot be easily seen on 
a computer display, users can extract relevant subsets, aggregate data into meaningful 
units, or randomly sample to create a manageable data set. 

The commercial success of tools such as SAS IMP (www.sas.com), SPSS Diamond 
(www.spss.com), and Spotfire (www.spotfire.com) (Fig. 1), especially for pharmaceuti- 
cal drug discovery and genomic data analysis, demonstrate the attraction of visualization. 
Other notable products include Inxight’s Eureka (www.inxight.com) for multidimen- 
sional tabular data and Visual Insights’ eBizinsights (www.visualinsights.com) for web 
log visualization. 

Spence characterizes information visualization with this vignette "You are the owner 
of some numerical data which, you feel, is hiding some fundamental relation. ..you then 
glance at some visual presentation of that data and exclaim ’Ah ha! - now I understand.’ " 




22 



B. Shneiderman 



[13]. But Spence also cautions that "information visualization is characterized by so 
many beautiful images that there is a danger of adopting a ’Gee Whiz’ approach to its 
presentation." 

5 A Spectrum of Discovery Tools 

The happy resolution to these debates is to take the best insights from both extremes and 
create novel discovery tools for many different users and many different domains. Skilled 
problem solvers often combine observation at early stages, which leads to hypothesis- 
testing experiments. Alternatively they may have a precise hypothesis, but if they are 
careful observers during a controlled experiment, they may spot anomalies that lead 
to new hypotheses. Skilled problem solvers often combine statistical tests and visual 
presentation. A visual presentation of data may identify two clusters whose separate 
analysis can lead to useful results when a combined analysis would fail. Similarly, a 
visual presentation might show a parabola, which indicates a quadratic relationship 
between variables, but no relationship would be found if a linear correlation test were 
applied. Devotees of statistical methods often find that presenting their results visually 
helps to explain them and suggests further statistical tests. 

The process of combining statistical methods with visualization tools will take some 
time because of the conflicting philosophies of the promoters. The famed statistician 
John Tukey (1915-2000) quickly recognized the power of combined approaches [14]: 
"As yet I know of no person or group that is taking nearly adequate, advantage of the 
graphical potentialities of the computer... In exploration they are going to be the data 
analyst’s greatest single resource." The combined strength of visual data mining would 
enrich both approaches and enable more successful solutions [ 17] . However, most books 
on data mining have only brief discussion of information visualization and vice versa. 
Some researchers have begun to implement interactive visual approaches to data mining 
[10,2,15]. 

Accelerating the process of combining hypothesis testing with exploratory data anal- 
ysis will also bring substantial benefits. New statistical tests and metrics for uniformity 
of distributions, outlier-ness, or cluster-ness will be helpful, especially if visual inter- 
faces enable users to examine the distributions rapidly, change some parameters and get 
fresh metrics and corresponding visualizations. 

6 Case Studies of Combining Visualization with Data Mining 

One way to combine visual techniques with automated data mining is to provide support 
tools for users with both components. Users can then explore data with direct manipula- 
tion user interfaces that control information visualization components and apply statis- 
tical tests when something interesting appears. Alternatively, they can use data mining 
as a first pass and then examine the results visually. Direct manipulation strategies with 
user-controlled visualizations start with visual presentation of the world of action, which 
includes the objects of interest and the actions. Early examples included air traffic con- 
trol and video games. In graphical user interfaces, direct manipulation means dragging 
files to folders or to the trashcan for deletion. Rapid incremental and reversible actions 
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encourage exploration and provide continuous feedback so users can see what they are 
doing. Good examples are moving or resizing a window. Modern applications of direct 
manipulation principles have led to information visualization tools that show hundreds 
of thousands of items on the screen at once. Sliders, check boxes, and radio buttons 
allow users to filter items dynamically with updates in less than 100 milliseconds. 




The yellow dots abowe are hones in the DC area for sale. 
Vou nay get nore infornation on a hone by selecting it. 
You nay drag the 'A' and '8' distance narkers to your 
office or any other location you want to liue near. 
Select distances,, bedroons,, and cost ranges by 
dragging the corresponding slider boxes on the right. 
Select specific hone types and services by pressing 
the labeled buttons on the right . 
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Fig. 2. Dynamic Queries HomeFinder with sliders to control the display of markers indicating 
homes for sale. Users can specify distances to markers, bedrooms, cost, type of house and features 
[18] 



Early information visualizations included the Dynamic Queries HomeFinder (Fig. 

2) which allowed users to select from a database of 1 100 homes using sliders on home 
price, number of bedrooms, and distance from markers, plus buttons for other features 
such as fireplaces, central air conditioning, etc. [18]. 

This led to the FilmFinder [1] and then the successful commercial product, Spotfire 
(Fig. 1). One Spotfire feature is the View Tip that uses statistical data mining methods to 
suggest interesting pair-wise relationships by using linear correlation coefficients (Fig. 

3) . The ViewTip might be improved by giving more user control over the specification 
of interesting-ness that ranks the outcomes. 

While some users may be interested in high linear correlation coefficients, others 
may be interested in low correlation coefficients, or might prefer rankings by quadratic, 
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exponential, sinusoidal or other correlations. Other choices might be to rank distributions 
by existing metrics such as skewness (negative or positive) or outlierness [3]. New 
metrics for degree of uniformity, cluster-ness, or gap-ness are excellent candidates for 
research. We are in the process of building a control panel that allows users to specify 
the distributions they are seeking by adjusting sliders and seeing how the rankings shift. 
Five algorithms have been written for 1 -dimensional data and one for 2-dimensional 
data, but more will be prepared soon (Fig. 4). 




Fig. 3. Spotfire View Tip panel with ranking of possible 2-dimensional scatter plots in descending 
order by the strength of linear correlation. Here the strong correlation in baseball statistics is 
shown between Career At Bats and Career Hits. Notice the single outlier in the upper right corner, 
representing Pete Rose’s long successful career. 



A second case study is our work with time-series pattern finding [4]. Current tools 
for stock market or genomic expression data from DNA microarrays rely on clustering 
in multidimensional space, but a more user-controlled specification tool might enable 
analysts to carefully specify what they want [9] . Our efforts to build a tool, TimeSearcher, 
have relied on query specification by drawing boxes to indicate what ranges of values 
are desired for each time period (Fig. 5). It has more of the spirit of hypothesis testing. 
While this takes somewhat greater effort, it gives users greater control over the query 
results. Users can move the boxes around in a direct manipulation style and immediately 
see the new set of results. The opportunity for rapid exploration is dramatic and users 
can immediately see where matches are frequent and where they are rare. 
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Fig. 4. Prototype panel to enable user specification of 1 -dimensional distribution requirements. 
The user has chosen the Cluster Finder II in the Algorithm box at the top. The user has specified 
the cluster tightness desired in the middle section. The ranking of the Results at the bottom lists all 
distributions according to the number of identifiable clusters. The M93-007 data is the second one 
in the Results list and it has four identifiable clusters. (Implemented by Kartik Parija and Jaime 
Spacco). 
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Fig. 5. TimeSearcher allows users to specify ranges for time-series data and immediately see the 
result set. In this case two timehoxes have been drawn and 5 of the 225 stocks match this pattern 
[9]. 
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7 Conclusion and Recommendations 

Computational tools for discovery, such as data mining and information visualization 
have advanced dramatically in recent years. Unfortunately, these tools have been de- 
veloped by largely separate communities with different philosophies. Data mining and 
machine learning researchers tend to believe in the power of their statistical methods to 
identify interesting patterns without human intervention. Information visualization re- 
searchers tend to believe in the importance of user control by domain experts to produce 
useful visual presentations that provide unanticipated insights. 

Recommendation 1: integrate data mining and information visualization to invent dis- 
covery tools. By adding visualization to data mining (such as presenting scattergrams to 
accompany induced rules), users will develop a deeper understanding of their data. By 
adding data mining to visualization (such as the Spothre View Tip), users will be able 
to specify what they seek. Both communities of researchers emphasize exploratory data 
analysis over hypothesis testing. A middle ground of enabling users to structure their 
exploratory data analysis by applying their domain knowledge (such as limiting data 
mining algorithms to specific range values) may also be a source of innovative tools. 

Recommendation 2: allow users to specify what they are seeking and what they hnd 
interesting. By allowing data mining and information visualization users to constrain 
and direct their tools, they may produce more rapid innovation. As in the Spothre View 
Tip example, users could be given a control panel to indicate what kind of correlations 
or outliers they are looking for. As users test their hypotheses against the data, they hnd 
dead ends and discover new possibilities. Since discovery is a process, not a point event, 
keeping a history of user actions has a high payoff. Users should be able to save their 
state (data items and control panel settings), back up to previous states, and send their 
history to others. 

Recommendation 3: recognize that users are situated in a social context. Researchers and 
practitioners rarely work alone. They need to gather data from multiple sources, consult 
with domain experts, pass on partial results to others, and then present their hndings to 
colleagues and decision makers. Successful tools enable users to exchange data, ask for 
consultations from peers and mentors, and report results to others conveniently. 

Recommendation 4: respect human responsibility when designing discovery tools. If 
tools are comprehensible, predictable and controllable, then users can develop mastery 
over their tools and experience satisfaction in accomplishing their work. They want to 
be able to take pride in their successes and they should be responsible for their failures. 
When tools become too complex or unpredictable, users will avoid their use because 
the tools are out of their control. Users often perform better when they understand and 
control what the computer does [11]. 

If complex statistical algorithms or visual presentations are not well understood by 
users they cannot act on the results with conhdence. I believe that visibility of the sta- 
tistical processes and outcomes minimizes the danger of misinterpretation and incorrect 
results. Comprehension of the algorithms behind the visualizations and the implications 
of layout encourage effective usage that leads to successful discovery. 
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Abstract. In this paper we claim that meaningful representations can 
be learned by programs, although today they are almost always designed 
by skilled engineers. We discuss several kinds of meaning that repre- 
sentations might have, and focus on a functional notion of meaning as 
appropriate for programs to learn. Specifically, a representation is mean- 
ingful if it incorporates an indicator of external conditions and if the 
indicator relation informs action. We survey methods for inducing kinds 
of representations we call structural abstractions. Prototypes of sensory 
time series are one kind of structural abstraction, and though they are 
not denoting or compositional, they do support planning. Deictic rep- 
resentations of objects and prototype representations of words enable a 
program to learn the denotational meanings of words. Finally, we discuss 
two algorithms designed to find the macroscopic structure of episodes in 
a domain-independent way. 
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Abstract. We present the concept of a functional programming lan- 
guage called VML (View Modeling Language), providing facilities to in- 
crease the efficiency of the iterative, trial-and-error cycle which frequently 
appears in any knowledge discovery process. In VML, functions can be 
specified so that returning values implicitly “remember”, with a special 
internal representation, that it was calculated from the corresponding 
function. VML also provides facilities for “matching” the remembered 
representation so that one can easily obtain, from a given value, the 
functions and/or parameters used to create the value. Further, we de- 
scribe, as VML programs, successful knowledge discovery tasks which we 
have actually experienced in the biological domain, and argue that com- 
putational knowledge discovery experiments can be efficiently developed 
and conducted using this language. 



1 Introduction 

The general flow and components which comprise the knowledge discovery pro- 
cess have come to be recognized [4,10] in the literature. According to these 
articles, the KDD process can be, in general, divided into several stages such 
as: data preparation (selection, preprocessing, transformation) data mining, hy- 
pothesis interpretation/evaluation, and knowledge consolidation. It is also well 
known that a typical process will not only go one-way through the steps, but 
will involve many feedback loops, due to the trial-and-error nature of knowledge 
discovery [2]. 

Most research in the literature concerning KDD focus on only a single stage 
of the process, such as the development of efficient and intelligent algorithms for 
a specific problem in the data mining stage. On the other hand, it seems that 
there has been comparatively little work which considers the process as a whole, 
concentrating on the iterative nature inherent in any KDD process. 



K.P. Jantke and A. Shinohara (Eds.): DS 2001, LNAI 2226, pp. 30—44, 2001. 
@ Springer- Verlag Berlin Heidelberg 2001 




VML: A View Modeling Language for Computational Knowledge Discovery 31 



More recently, the concept of view has been introduced for describing the 
steps of this process in a uniform manner [1,12,13,14]. Views are essentially func- 
tions over data. These functions, as well as their combinations, represent ways of 
looking at data, and the values they return are attributes values concerning their 
input arguments. The relationship between data, view, and the result obtained 
by applying a view to the data, can be considered as knowledge. The goal of 
KDD can be restated as the search for meaningful views. Views also provide an 
elegant interface for human intervention into the discovery process [1,12], whose 
need has been stressed in [9] . The iterative cycle of KDD consists very much of 
composing and decomposing of views, and facilities should be provided to assist 
these activities. 

The purpose of this paper is to present the concept of a programming lan- 
guage, VML (View Modeling Language), which can help speed up this iterative 
cycle. We consider extending the Objective Caml (OCaml) language [27], a func- 
tional language which is a dialect of the ML [16] language. We chose a functional 
language for our base, since it can handle higher order values (functions) just like 
any other value, which should help in the manipulation of views. Also, functional 
languages have a reputation for enabling efficient and accurate programming of 
maintainable code, even for complex applications [6]. 

We focus on the fact that the primary difference between a view and a 
function, is that views must always have an interpretable meaning, because the 
knowledge must be interpretable to be of any use. The two extensions we con- 
sider are the keywords ‘view’ and ‘vmatch’. ‘view’ is used to bind a function to 
a name as well as instructing the program to remember any value resulting from 
the function, ‘vmatch’ is a keyword for the decomposing of functional application, 
enabling the extraction of the origins of remembered values. 

Of course, it is not impossible to accomplish the “remembering” with conven- 
tional languages. For example, we can have each function return a data structure 
which contains the resulting value and their representation. However, we wish to 
free the programmer from the labor of keeping track this data structure: what 
parameters were used where and when, by packaging this information implicitly 
into the language. As a result, the following tasks, for example, can be done 
without much extra effort: 

- Interpret knowledge (functions and their parameters) obtained from learn- 
ing/discovery programs. 

- Reuse knowledge obtained from previous learning/discovery rounds. 

Although we do not yet have a direct implementation of VML, we have been 
conducting computational experiments written in the C-| — h language based on 
the idea of views, obtaining substantial results [1,25]. We show how such exper- 
iments can be conducted comparatively easily by describing the experiments in 
terms of VML. 

The structure of this paper is as follows: Basic concepts of views and VML 
is described in Section 3. We describe, using VML, two actual discovery tasks 
we have conducted in Section 4. We discuss various issues in Section 5. 
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2 Related Work 

There have been several knowledge discovery systems which focus on similar 
problems concerning the KDD process as a whole. KEPLER [21] concentrates 
on the extensibility of the system, adopting a “plug-in architecture”. CLEMEN- 
TINE [8] is a successful commercial application which focuses on human inter- 
vention, providing components which can be easily put together in many ways 
through a GUI interface. Our work is different and unique in that it tries to 
give a solution at a more generic level - until we understand the nature of the 
data, we must try, literally, any method we can come up with, and therefore 
universality is desired in our approaches. 

Concerning the “remembering” of the origin of a value, one way to accomplish 
this is to remember the source code of the function. For example, some dialects of 
LISP provide a function called get-lamhda-expression, which returns the actual 
lisp code of a given closure. However, this can return too much information 
concerning the value (e.g. the source code of a complicated algorithm). The 
idea in our work is to limit the information that the user will see, by regarding 
functions specified by the view keyword as the smallest unit of representation. 



3 Concepts 

In this section, we first briefly describe the concept of views, as found in [1]. 
Then, we discuss the basic concepts of VML, as an extension to the OCaml 
language [27], and give simple examples. 

3.1 Entity, Views, and View Operation 

Here, we review the definitions of entity, view, and view operation, and show how 
the KDD process can be described in terms of these concepts. An entity set E is 
a set of objects which may be distinguished from one another, representing the 
data under consideration. Each object e € E is called an entity. A viewv : E R 
is a function over E. v will take an entity e, and return some aspect (i.e. attribute 
value) concerning e. A view operation is an operation which generates new views 
from existing views and entities. Below are some examples: 

Example 1. Given a view v : E R, a, new view v' : E R' may be created 
with a function ip : R ^ R' (i.e. v' = 'tpov:E^R^ R'). 

We can also consider n-ary functions as views. All arguments except for the 
argument expecting the entity can be regarded as parameters of the view. 

Hypothesis generation via machine learning algorithms can also be considered 
as a form of view operation. The generated hypothesis can also be considered a 
view. 

Example 2. Given a set of data records (entities) and their attributes (views), the 
IDS algorithm [18] (view operator) generates a decision tree T. T is also a view 
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because it is a function which returns the class that a given entity is classified 
to. The generated view T can also be used as an input to other view operations, 
to create new views, which can be regarded as knowledge consolidation. 

Views and view operators are combined to create new views. The structure of 
such combinations of a compound view, is called the design of the view. The 
task of KDD lies in the search for good views which explain the data. Knowledge 
concerning the data is encapsulated in its design. Human intervention can be 
conducted through the hand-crafted design of views by domain experts. To suc- 
cessfully assist the expert in the knowledge discovery process, the expert should 
be able to manipulate and understand the view design with ease. 



3.2 Representations 

Here, we describe the basic concepts in VML. We shall call how a certain value 
is created, its representation. For example, if an integer value 55 was created by 
adding the numbers from 1 to 10, the representation of 55 is informally, “add 
the integers from 1 to 10” . A value may have multiple representations, but every 
representation should have only one corresponding value (except if there is some 
sort of random process in the representation). Intuitively, the representation 
for any value can be considered as the source code for computing that value. 
However, in VML, the representation is limited to only primitive values (first 
order values), and also application to functions specified with the view keyword, 
so that it is feasible for the users to understand and interpret the values, seeing 
only the information that they want to see. 

The purpose of the view keyword is to specify that the runtime system should 
remember the representation of the return value of the function. We shall call 
such specified functions, view functions. Representations of values can be defined 
as: 

rep ::= primv (* primitive values *) 

I vf repl . . . repn (* application to view functions *) 

I X . rep’ (* A-abstraction of representations *) 

vmatch is used to extract components from the representation of a value, by 
conducting pattern matching against the representation. 



3.3 Simple Example 

We give a simple example to illustrate basic OCaml syntax and the use of view 
and vmatch statements. The syntax and semantics of VML are the same as 
OCaml except for the added keywords. Only descriptions for the extended key- 
words are given, and the reader is requested to consult the Objective Caml 
Manual [27] for more information. 
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For the example in the previous subsection, a function which calculates the 
sum of positive integers 1 to n can be written in OCaml as:^ 

# let rec sumn n = if n <= 0 then 0 else (n + smnn (n-1));; 
val sumn : int -> int = <fun> 

# sumn 10; ; (* apply 10 to sumn *) 

- : int = 55 

let binds the function (value) to the name sumn. rec specifies that the function 
is a recursive function (a function that calls itself), n is an argument of the 
function sumn. int -> int is the type of the function sumn, which reads as 
follows: “sumn is a function that takes a value of type int as an argument, 
and returns a value of type int” . Notice that the type of sumn is automatically 
inferred by the compiler/interpreter, and need not be specified. Arguments can 
be applied to functions just by writing them consecutively. 

The syntax of the view keyword is the same as the let statement. If we 
specify the above function with the view keyword in place of let: (we capitalize 
the first letter of view functions for convenience) 

# view rec Sumn n = if n <= 0 then 0 else (n + Sumn (n-1));; 
val Sumn : int -> int = <f un> : : (n . Sumn n) 

# Sumn 10; ; 

- : int = 55:: (Sumn 10) 

Sumn is defined as a view function, and therefore, values calculated from Sumn 
are implicitly remembered. In the above example, the return value is 55, and 
its representation, shown to the right of the double colon is (Sumn 10). 
We do not need to see the inside of Sumn, if we know the meaning of Sumn, to 
understand this value of 55. 

The vmatch keyword is used to decompose a representation of a value and 
extract the function and/or any parameters which were used to created the value. 
Its syntax is the same as the match statement of OCaml, which is used for the 
pattern matching of miscellaneous data structures. 

# let V = Sumn 10;; (* apply 10 to Sumn and bind the value to v *) 
val V : int = 55:: (Sumn 10) 

# vmatch v with (* Extract parameters used to calculate v *) 

(Sumn x) -> printf ""/d was applied to Sumn\n" x 
I _ -> printf "Error: v did not match (Sumn x)\n";; 

10 was applied to Sumn 

- : unit = 0 

In the above example, the representation of v, which is (Sumn 10) , is matched 
against the pattern (Sumn x). If the match is successful, ‘x’ in the pattern is 

^ The expressions starting after ’#’ and ending with ’ is the input by the user, the 
others are responses from the compiler/interpreter. Comments are written between 
“(*’ and “*)’. 
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assigned the corresponding value 10. This value can be used in the expression 
to the right of which is evaluated in case of a match. Multiple patterns 
matches can be attempted: each pattern and its corresponding expression are 
separated by ‘I’, and the expression for the first matching pattern is evaluated. 
The underscore represents a wild card pattern, matching any representation. 
The entire vmatch expression evaluates to the unit type 0 (similar to void 
in the C language) in this case, because print f is a function that executes a 
side-effect (print a string), and returns (). 



3.4 Partial Application 

Here, we consider how representations of partial applications to view functions 
can be done. We note, however, that our description here may contains sub- 
tle problems, for example, concerning the order of evaluation of the expressions, 
which may be counter intuitive when programs contain side-effects. A formal de- 
scription and sample implementation resolving these issues can be found in [20]. 

In the previous examples, we added integers from 1 to n. Suppose we want 
to specify where to start also: add the integers from m to n. We can write the 
view function as follows: 

# view rec Sum_m_to_n m n = if (m > n) then 0 

else (n + (Sum_m_to_n m (n-1)));; 

val Sum_m_to_n : int -> int -> int = <fun>::(m n . Smn_m_to_n m n) 

# let sum3to = Sum_m_to_n 3;; (* partial application *) 

val sum3to : int -> int = <fun>::(n . Sum_m_to_n 3 n) 

# sum3to 5 ; ; 

- : int = 12 : : (Sum_m_to_n 3 5) 

Sum_m_to_n is a view function of type int->int->int, which can be read as “a 
function that takes two arguments of type int and returns a value of type int” , 
or, “a function that takes one argument of type int and returns a value of type 
int->int”. In defining sum3to, Sumjm_to_n is applied with only one argument, 
3, resulting in a function of type int->int. Applying another argument 5 to 
sum3to will result in the same value as Sum_m_to_n 3 5. 

Partially applied values are matched as follows. Arguments not applied will 
only match the underscore 

# vmatch sum3to with 

(Sum_m_to_n x _) 

-> printf "Sum_m_to_n partially applied with 7,d\n" x 
I _ -> printf "failed match\n";; 

Sum_m_to_n partially applied with 3 

- : unit = 0 

We can reverse the order of arguments by the fun keyword, which is essen- 
tially lambda abstraction. 
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# let smnlOfrom = fun m -> Sum_m_to_n m 10;; 

val sumlOfrom : int -> int = <fun>::(m . Smn_m_to_n m 10) 

# sumlOfrom 5;; 

- : int = 45 : : (Sum_m_to_n 5 10) 

# vmatch sumlOfrom with 

(Sum_m_to_n _ x) 

-> printf "Sum_m_to_n partially applied with 7od\n" x 
I _ -> printf "failed match\n";; 

Sum_m_to_n partially applied with 10 

- : unit = 0 

The representation for sumlOfrom is the result obtained by /3-reduction of the 
representation. 



(m . ( (m n . Sum_m_to_n m n) m 10)) 
^ (m . ( (n . Sum_m_to_n m n) 10)) 

0 (m . (Sum_m_to_n m 10)) 



3.5 Multiple Representations 

In the example with Sumn, although Sumn recursively calls itself, the represen- 
tation of the values generated in the recursive calls is not remembered, because 
the function ' + ’ is not a view function. If multiple representations are to be 
remembered, they can be maintained with a list of representations, and vmatch 
will try to match any of the representations. 



4 Actual Knowledge Discovery Tasks 

We describe two computational knowledge discovery experiments, showing how 
VML can assist the programmer in such experiments. As noted in Section 1, 
VML is not yet fully implemented, and therefore the experiments conducted 
here were developed with the C-I--I- language, using the HypothesisCreator 
library [25], based on the concept of views. 



4.1 Detecting Gene Regulatory Sites 

It is known that: for many genes, whether or not the gene expresses its function 
depends on specific proteins, called transcription factors, which bind to specific 
locations on the DNA, called gene regulatory sites. Gene regulatory sites are 
usually located in the upstream region of the coding sequence of the gene. Since 
proteins selectively bind to these sites, it is believed that common motifs exists 
for genes which are regulated by the same protein. We consider the case where 
the 2 -block motif model is preferred, that is, when the binding site cannot be 
characterized by a single motif, and 2 motifs should be searched for. 
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view ListDistAnd min max 11 12: int->int-> (int list)->(int list)->bool 
Return true if there exists el € 11, e2 G 12 
such that min < (e2 — el) < max. 

view AstrstrList mm pat str: astr_mismatch->string->string->(int list) 
Return the match positions (using approximate pattern matching) 
of a pattern as a list of int . 

The type astr_mismatch is the tuple (int * bool * bool * bool) where 
the int value is the maximum number of errors allowed, and the bool 
values are flags to permit the error types: insertion, deletion, 
and substitution, respectively. 



Fig. 1. View functions used in the view design for detecting putative gene regulatory 
sites. 



We develop a simple, original method, based on views. Testing the method 
on B.subtilis cr"^-dependent promoter sequences taken from [5], our method was 
able to rediscover the same results, as well as other candidates for 2-block motifs. 

We started by modeling the 2-block motif for regulatory sites as consisting 
of three components: the motif pattern (a string pattern, with possible mis- 
matches), the gap width of these patterns (how far apart they can be), and their 
positions (distance in base pairs from the beginning of the coding sequence) . We 
construct a function with the following design (the representation is omitted) : 

# let orig pos len g_min g_max mml mm2 patl pat2 str = 

ListDistAnd g_min g_max 

(AstrstrList mml patl (Substring pos len str)) 
(AstrstrList mm2 pat2 (Substring pos len str));; 
val orig : int -> int -> float -> float -> astr_mismatch -> 

astr_mismatch -> string -> string -> string -> bool = <fun> 

The explanations for the view functions used are given in Figure 1. The ar- 
guments except str are parameters, and when all the parameters are applied, a 
function of type string->bool is generated, returning true if a certain 2-block 
motif appears for a given string, and false otherwise. To look for good param- 
eters, we take a supervised learning approach and randomly selected genes of 
B.subtilis not included in the original dataset, from the GenBank database [24], 
as negative data. The score of each view is based on its accuracy as a classifica- 
tion function that interprets whether or not an input sequence has the motifs. 
We looked at several top ranking views in order to evaluate them. 

Numerous iterations with different search spaces yielded some interesting 
results. Selected results are shown in Figure 2. By limiting the search space 
by using knowledge obtained from previous work, we were able to come up 
with views vl and v2 where the 2-block motifs were consistent or were the 
same with “TTGACA” and “TATAAT” as detected in [5,11]. We also ran the 
experiments with a wider range of parameters, and found a view v3, that could 
perfectly discriminate the positive and negative examples. Although a biological 
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vl: (str . ListDistAnd 20 30 

(AstrstrList (2, false, false, true) "ttgtca" (Substring -40 35 str)) 
(AstrstrList (2, false, false, true) "tataat" (Substring -40 35 str))) 
true positive 102 false negative 40 = 71.8 7, 

false positive 0 true negative 142 - 100.0 7, 

v2: (str . ListDistAnd 20 30 

(AstrstrList (2, false, false, true) "ttgaca" (Substring -40 35 str)) 
(AstrstrList (2, false, false, true) "tataat" (Substring -40 35 str))) 
true positive 100 false negative 42 = 70.4 7o 

false positive 0 true negative 142 - 100.0 7o 

v3: (str . ListDistAnd 25 35 

(AstrstrList (3, false, false, true) "atgatc" (Substring -50 65 str)) 
(AstrstrList (2, false, false, true) "gttata" (Substring -50 65 str))) 
true positive 142 false negative 0 = 100.0 7o 

false positive 0 true negative 142 - 100.0 7o 



Fig. 2. Representations of the results of our method to find regulatory sites. 



interpretation must follow for the result to be meaningful, we were successful in 
finding a candidate for a novel result. 

In this kind of experiment, VML can help the expert in the following way: 
Although the views are sorted by some score, it is difficult to check the validity 
of a view according to the score: i.e., a valuable view will probably have a high 
score, but a view with a high score may not be valuable. In the evaluation stage, 
there is a need for the expert to look at the many different views with adequately 
high scores, and see what kind of parameters were used to generate the view. 
This could be written easily in VML since it would be just to obtain and display 
the representations of high scoring functions. 

4.2 Characterization of N-Terminal Sorting Signals of Proteins 

Proteins are composed of amino acids, and can be regarded as strings consisting 
of an alphabet of 20 characters. Most proteins are first synthesized in the cytosol, 
and carried to specified locations, called localization sites. In most cases, the 
information determining the subcellular localization site is represented as a short 
amino acid sequence segment called a protein sorting signal [17]. Given an amino 
acid sequence, predicting where the protein will be carried to is an important 
and difficult problem in molecular biology. Although numerous signal sequences 
have been found, similarities between these sequence for the same localization 
site are not yet fully understood. Our aim was to come up with a predictor which 
could challenge TargetP [3], the state-of-the-art neural network based predictor, 
in terms of prediction accuracy while not sacrificing the interpretability of the 
classification rule. 

Data available from the TargetP web-site [28] was used, consisting of 940 
sequences containing 368 mTP (mitochondrial targeting peptides), 141 cTP 
(chloroplast transit peptides), 269 SP (signal peptides), and 162 “Other” se- 
quences. The general approach was to: discuss with an expert on how to design 
the views, conduct computational experiments with those view designs, present 
results to the expert as feedback, and then repeat the process. 
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We first considered binary classifiers, which distinguishes sequences of a cer- 
tain signal. The entity set is the set of amino acid sequences. The views we look 
for are of type string -> bool: for an amino sequence, return a Boolean value, 
true if the sequence contains a certain signal, and false if it does not. The views 
we designed (in time order) can be written in VML as follows (the meanings of 
each view function is given in Figure 3): 

# let hi pat mm ind pos len str = 

Astrstr mm pat (Alphlnd ind (Substring pos len str));; 
val hi : string -> astr_mismatch -> (char -> char) -> int -> 
int -> string -> bool = <fun> : : (pat mm ind pos len str . 
Astrstr mm pat (Alphlnd ind (Substring pos len str)) 

# let h2 thr ind pos len str = 

GT (Average (AAindex ind (Substring pos len str))) thr;; 
val h2 : float -> string -> int -> int -> string -> 
bool = <fun>::(thr ind pos len str . 

GT (Average (AAindex ind (Substring pos len str))) thr 

# let h3 thr aaind posl lenl pat mm alphind pos2 len2 str = 

And (hi pat mm alphind posl lenl str) 

(h2 thr aaind pos2 len2 str) ; ; 
val hS : float -> string -> int -> int -> string -> 

astr_mismatch -> (char -> char) -> int -> int -> string -> 
bool = <fun>::(thr aaind posl lenl pat mm 

alphind pos2 len2 str . And (hi pat mm alphind posl lenl str) 

(h2 thr aaind pos2 len2 str) 

Notice that after applying all the arguments except for the last string, we can 
obtain functions of type string -> bool as desired. For example, using view 
function h2, we can create a view function of type string -> bool: 

# let f = h2 3.5 "B1GC670101" 5 20;; 
val f : string -> bool = 

<fun>;:(str . (GT (Average (AAindex "B1GC670101" 

(Substring 5 20 str)) 3.5))) 

Each function is composed of view functions, so representation of such a function 
will contain information of the arguments. The representation of the above rule 
can be read as: “For a given amino acid sequence, first, look at the substring of 
length 20, starting from position 5. Then, calculate the average volume^ of the 
amino acids appearing in the substring, and return true if it the value is greater 
than 3.5, false otherwise”. 

The task is now to find good parameters which defines a function that can 
accurately distinguish the signals. For each view design, a wide range of param- 
eters were applied. For each combination of parameters and view design shown 

^ “BIGC670101” is the accession id for amino acid index: ‘volume’. 
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view Substring pos len str : int -> int -> string -> string 

return substring: [pos ,pos+len-l] of str. A negative value for pos 
means to count from the right end of the string, 
view Alphind ind str : (char -> char) -> string -> string 

convert str according to alphabet indexing ind. ind is a mapping of 
char->char, called an alphabet indexing [19] , and can be considered 
as a classification of the characters of a given alphabet, 
view Astrstr mm pat str : astr_mismatch -> string -> string -> bool 
approximate pattern matching [22] : match pat & str with mismatch mm. 
Type ‘ astrjnismatch’ is explained in Figure 1. 
view AAindex ac str : string -> string -> (float array) 

convert str to an array of float according to amino acid index: ac . 
ac is an accession id of an entry in the AAindex database [7] . Each 
entry in the database represents some biochemical property of amino 
acids, such as volume, hydropathy, etc., represented as a mapping of 
char -> float. 

view Average v : float array -> float 
the average of the values in v 
view GT x y : ’a -> ’a -> bool 
greater than 

view And x y : bool -> bool -> bool 
Boolean 'cuid’ 



Fig. 3. View functions used in the view design to distinguish protein sorting signals. 



above, we obtain a function: string->bool. The programmer need not worry 
about keeping track of the meanings of each function, because the representation 
may be consulted using the vmatch statement when needed. We apply all the 
protein sequences to this function, and calculate the score of this function as a 
classifier of a certain signal. Functions with the best scores are selected. 

View design hi, looks for a pattern over a sequence converted by a classifi- 
cation of an alphabet [19]. We hoped to find some kind of structural similarities 
of the signals with this design, but we could not find satisfactory parameters 
which would let hi predict the signals accurately. Next, we designed a new view 
h2 which uses the AAindex database [7] , this time looking for characteristics of 
the amino acid composition of a sequence segment. This turned out to be very 
effective, especially for the SP set, and was used to distinguish SP from the other 
signals. For the remaining signals, we tried combining hi and h2 into h3. This 
proved to be useful for distinguishing the “Other” set (those which do not have 
N-terminal signals), from mTP and cTP. We can see that the functional nature 
of VML enables the easy construction of the view designs. 

By combining the views and parameters thus obtained for each signal type 
into a single decision list, we were able to create a rule which competes fairly 
well with TargetP in terms of prediction accuracy. The scores of a 5-fold cross- 
validation is shown in Table 1. The knowledge encapsulated in the view design 
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Table 1. The Prediction Accuracy of the Final Hypothesis (scores of TargetP [3] in 
parentheses) The score is defined by: (tpxtn-fpxfn) where tp, tn, fp, 

' y/{tp+fn){tp+fp){tn+fp){tn + fr,) 

fn are the number of true positive, true negative, false positive, and false negative, 
respectively (Matthews correlation coefficient (MCC) [15]). 



True 

category 


#of 

seqs 


Predicted category 


Sensitivity MCC 


cTP 


mTP 


SP 


Other 


cTP 


141 


96 (120) 


26 (14) 


0 (2) 


19 (5) 


0.68 (0.85) 0.64 (0.72) 


mTP 


368 


25 (41) 


309 (300) 


4 (9) 


30 (18) 


0.84 (0.82) 0.75 (0.77) 


SP 


269 


6 (2) 


9 (7) 


244 (245) 


10 (15) 


0.91 (0.91) 0.92 (0.90) 


Other 


162 


8 (10) 


17 (13) 


2 (2) 


135 (137) 


0.83 (0.85) 0.71 (0.77) 


Specificity 


|0.71 (0.69) 0.86 (0.90) 0.98 (0.96) 0.70 (0.78) 





was consistent with widely believed (but vague) characteristics of each signal, 
and the expert was surprised that such a simple rule could describe the sorting 
signals with such accuracy. A system called iPSORT was built based on these 
rules, and an experimental web service is provided at the iPSORT web-site [26] . 

vmatch can be useful in the following situation: After obtaining a good view 
of design h2, we may want to see if we can find a good view of design hi, but 
use the same substring sequence as h2. This can be regarded as first looking for 
a segment which has a distinct amino acid composition, and then looking closer 
at this segment, to see if structural characteristics of the segment can be found. 
This function can be written as: 

# let newh f = vmatch f with 

GT (Average (AAindex _ (Substring p 1 _))) _ -> 
fun pat mm ind str -> hi pat mm ind p 1 str;; 
val newh : ’_a -> string -> astr_mismatch -> (char -> char) 

-> string -> bool = <fun> 

If the representation of a function h was for example: 

(str . GT (Average (AAindex ind (Substring 3 16 str))) 3.5) 

then, the representation of (newh h) would become: 

(pat mm ind str . 

(Astrstr mm pat (Alphlnd ind (Substring 3 16 str)))) 

representing a function of design hi, but using the parameters of h of view design 
h2 for Substring. Again, we need not worry about explicitly keeping track of 
what values were applied to h2 to obtain h, since it is implicitly remembered and 
can be extracted by the vmatch keyword. Thus, we have seen that the design 
and manipulation of views can be done easily with VML, and would assist the 
trial-and-error cycle of the experiments. 
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5 Discussion 

5.1 Implementation 

In the C++ library, each view function is encapsulated in an instance of a 
class derived from the view class. The view class has a method for interpreting 
the value for an entity. Constructors for various derived classes can take other 
instances of view classes as arguments. The view class also has a method which 
returns the view classes which were used to build the instance (a facility for 
simulating vmatch, for decomposing the functions). However, after spending 
much time in development, we came to feel that CH — h was error prone and rather 
tedious to code the view functions. Also, although the view classes encapsulate 
functions, the function itself could not be easily reused for other purposes. 

For the points mentioned above, we can safely say that VML is advantageous 
over our C++ library. However, an efficient implementation of VML, which is 
beyond the scope of this paper, is a topic of interest. The implementation given 
in [20] uses the Camlp4 preprocessor (and printer) [23], which converts a VML 
program (with a different syntax from this paper) into an OCaml program, and 
it may be the case that there are optimizations that can be performed by a 
dedicated compiler. 



5.2 Conclusion 

We presented the concept of a language called VML, as an extension of the Ob- 
jective Caml language. The advantages of VML are: 1) Since VML is a functional 
language, the composition and application of views can be done in a natural way, 
compared to imperative languages. 2) By defining the unit of knowledge as views, 
the programmer does not need to explicitly keep track of how each individual 
view was designed (i.e. manage data structures to remember the set of param- 
eters). 3) The programmer can use “parts” of a good view which can only be 
determined perhaps at runtime, and apply it to another (the example in Section 
4.2). 4) In an interactive interface, (i.e. a VML interactive interpreter), the user 
can compose and decompose views and view designs, and apply them to data. 
When the user accidently stumbles upon an interesting view, he/she can retrieve 
the design immediately. 

Using VML, we modeled and described successful knowledge discovery tasks 
which we have actually experienced, and showed that the points noted above 
can lighten the burden of the programmer, and as a result, give way to speed- 
ing up the iterative trial-and-error cycle of computational knowledge discovery 
processes. 
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Abstract. The Symposium on Computational Discovery of Communi- 
cable Knowledge was held from March 24 to 25, 2001, at Stanford Uni- 
versity. Fifteen speakers reviewed recent advances in computational ap- 
proaches to scientific discovery, focusing on their discovery tasks and 
the generated knowledge, rather than on the discovery algorithms them- 
selves. Despite considerable variety in both tasks and methods, the talks 
were unihed by a concern with the discovery of knowledge cast in for- 
malisms used to communicate among scientists and engineers. 



Computational research on scientific discovery has a long history within both 
artificial intelligence and cognitive science. Early efforts focused on reconstruct- 
ing episodes from the history of science, but the past decade has seen similar 
techniques produce a variety of new scientific discoveries, many of them leading 
to publications in the relevant scientific literatures. Work in this paradigm has 
emphasized formalisms used to communicate among scientists, including numeric 
equations, structural models, and reaction pathways. 

However, in recent years, research on data mining and knowledge discovery 
has produced another paradigm. Even when applied to scientific domains, this 
framework employs formalisms developed by artificial intelligence researchers 
themselves, such as decision trees, rule sets, and Bayesian networks. Although 
such methods can produce predictive models that are highly accurate, their 
outputs are not stated in terms familiar to scientists, and thus typically are not 
very communicable. 

To highlight this distinction, Pat Langley organized the Symposium on Com- 
putational Discovery of Communicable Knowledge, which took place at Stanford 
University’s Center for the Study of Language and Information on March 24 and 
25, 2001. The meeting’s aim was to bring together researchers who are pursuing 
computational approaches to the discovery of communicable knowledge and to 
review recent advances in this area. The primary focus was on discovery in sci- 
entific and engineering disciplines, where communication of knowledge is often 
a central concern. 
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Each of the 15 presentations emphasized the discovery tasks (the problem 
formulation and system input, including data and background knowledge) and 
the generated knowledge (the system output). Although artificial intelligence 
and machine learning traditionally focus on differences among algorithms, the 
meeting addressed the results of computational discovery at a more abstract 
level. In particular, it explored what methods for the computational discovery 
of communicable knowledge have in common, rather than the great diversity of 
methods used to that end. 

The commonalities among methods for communicable knowledge discovery 
were summarized best by Raul Valdes-Perez in a presentation titled A Recipe for 
Designing Discovery Programs on Human Terms. The key step in his recipe was 
identifying a set of possible solutions for some discovery task, as it is here that 
one can adopt a formalism that humans already use to represent knowledge. 
Valdes-Perez viewed computational discovery as a problem-solving activity to 
which one can apply heuristic-search methods. He illustrated the recipe on the 
problem of discovering niche statements, i.e., properties of items that make them 
unique or distinctive in a given set of items. 

The knowledge representation formalisms considered in the different pre- 
sentations were diverse and ranged from equations through qualitative rules to 
reaction pathways. Most talks at the symposium fell within two broad cate- 
gories. The first was concerned with equation discovery in either static systems 
or dynamic ones that change over time. The second addressed communicable 
knowledge discovery in biomedicine and in the related fields of biochemistry and 
molecular biology. 

One formalism that scientists and engineers rely on heavily is equations. 
The task of equation discovery involves finding numeric or quantitative laws, 
expressed as one or more equations, from collections of measured numeric data. 
Most existing approaches to this problem deal with the discovery of algebraic 
equations, but recent work has also addressed the task of dynamic system iden- 
tification, which involves discovering differential equations. 

Takashi Washio from Osaka University presented a talk about Conditions on 
Law Equations as Communicable Knowledge, in which he discussed the condi- 
tions that equations must satisfy to be considered communicable. In addition to 
fitting the observed data, these include generic conditions and domain-dependent 
conditions. The former include objectiveness, generality, and reproducibility, as 
well as parsimony and mathematical admissibility with respect to unit dimen- 
sions and scale type constraints. 

Kazumi Saito from Nippon Telegraph and Telephone and Mark Schwabacher 
from NASA Ames Research Center presented two related applications of compu- 
tational equation discovery in the environmental sciences, both concerned with 
global models of the Earth ecosystem. Saito’s talk on Improving an Ecosys- 
tem Model Using Earth Science Data addressed the task of revising an existing 
quantitative scientific model for predicting the net plant production of carbon 
in the light of new observations. Schwabacher’s talk. Discovering Communicable 
Scientific Knowledge from Spatio-Temporal Data in Earth Science, dealt with 
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the problem of predicting from climate variables the Normalized Difference Veg- 
etation Index, a measure of greenness and a key component of the previous 
ecosystem model. 

Four presentations discussed the task of dynamic system identification, which 
involves identifying the laws that govern behavior of systems with continuous 
variables that change over time. Such laws typically take the form of differen- 
tial equations. Two of these talks described extensions to equation discovery 
methods to address system identification, whereas the other talks reported work 
that began with methods for system identification and incorporated artificial 
intelligence techniques that take advantage of domain knowledge. 

Saso Dzeroski from the Jozef Stefan Institute, in his talk on Discovering Or- 
dinary and Partial Differential Equations, gave an overview of computational 
methods for discovering both ordinary and partial differential equations, the 
second of which describe dynamic systems that involve change over several di- 
mensions (e.g., space and time). Ljupco Todorovski, from the same research 
center, discussed an approach that uses domain knowledge to aid the discov- 
ery process in his talk. Using Background Knowledge in Differential Equations 
Discovery. He showed how knowledge in the form of context-free grammars can 
constrain discovery in the domain of population dynamics. 

Reinhard Stolle, from Xerox PARC, spoke about Communicable Models and 
System Identification. He described a discovery system that handles both struc- 
tural identification and parameter estimation by integrating qualitative reason- 
ing, numerical simulation, geometric reasoning, constraint reasoning, abstrac- 
tion, and other mechanisms. Matthew Easley from the University of Colorado, 
Boulder, reported extensions to Stolle’s framework in his presentation. Incor- 
porating Engineering Formalisms into Automated Model Builders. His approach 
relied on input-output modeling to plan experiments and using the resulting 
data, combined with knowledge at different levels of abstraction, to construct a 
differential equation model. 

The talk by Feng Zhao from Xerox PARC, Structure Discovery from Massive 
Spatial Data Sets, described an approach to analyzing spatio-temporal data that 
relies on the notion of spatial aggregation. This mechanism generates summary 
descriptions of the raw data, which it characterizes at varying levels of detail. 
Zhao reported applications to several challenging problems, including the inter- 
pretation of weather data, optimization for distributed control, and the analysis 
of spatio-temporal diffusion-reaction patterns. 

The rapid growth of biological databases, such as that for the human genome, 
has led to increased interest in applying computational discovery to biomedicine 
and related fields. Five presentations at the symposium focused on this general 
area. They covered a variety of discovery methods, including both propositional 
and first-order rule induction, genetic programming, theory revision, and ab- 
ductive inference, with similar breadth in the biological discovery tasks to which 
they were applied. 

Bruce Buchanan and Joseph Phillips, from the University of Pittsburgh, gave 
a presentation titled Introducing Semantics into Machine Learning. This focused 
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on their incorporation of domain knowledge into rule-induction algorithms to let 
them find interesting and novel relations in medicine and science. They reviewed 
both syntactic and semantic constraints on the rule discovery process and showed 
that stronger forms of background knowledge increase the chances that discov- 
ered rules are understandable, interesting, and novel. 

Stephen Muggleton from York University, in his talk Knowledge Discovery 
in Biological and Chemical Domains, described his application of first-order rule 
induction to predicting the structure of proteins, modeling the relations between 
a chemical’s structure and its activity, and predicting a protein’s function from its 
structure (e.g., identifying precursors of neuropeptides). Knowledge discovered 
in these efforts has appeared in journals for the respective scientific areas. 

John Koza from Stanford University presented Reverse Engineering and Au- 
tomatic Synthesis of Metabolic Pathways from Observed Data. His approach uti- 
lized genetic programming to carry out search through a space of metabolic 
pathway models, with search directed by the models’ abilities to fit time-series 
data on observed chemical concentrations. The target model included an internal 
feedback loop, a bifurcation point, and an accumulation point, suggesting the 
method can handle complex metabolic processes. 

The presentation by Pat Langley, from the Institute for the Study of Learn- 
ing and Expertise, addressed Knowledge and Data in Computational Biological 
Discovery. He reported an approach that used data on gene expressions to revise 
a model of photosynthetic regulation in Cyanobacteria previously developed by 
plant biologists. The result was an improved model with altered processes that 
better explains the expression levels observed over time. The ultimate goal is an 
interactive system to support human biologists in their discovery activities. 

Marc Weeber from the U.S. National Library of Medicine reported on a quite 
different approach in his talk on Literature-based Discovery in Biomedicine. The 
main idea relies on utilizing bibliographic databases to uncover indirect but 
plausible connections between disconnected bodies of scientific knowledge. He 
illustrated this method with a successful example of finding potentially new 
therapeutic applications for an existing drug, thalidomide. 

Sakir Kocabas, from Istanbul Technical University, talked about The Role 
of Completeness in Particle Physics Discoveries, which dealt with a completely 
different domain. He described a computational model of historical discovery 
in particle physics that relies on two main criteria - consistency and complete- 
ness - to postulate new quantum properties, determine those properties’ values, 
propose new particles, and predict reactions among particles. Kocabas’ system 
successfully simulated an extended period in the history of this field, including 
discovery of the neutrino and postulation of the baryon number. 

At the close of the symposium, Lorenzo Magnani from the University of Pavia 
commented on the presentations from a philosophical viewpoint. In particular, 
he cast the various efforts in terms of his general framework for abduction, which 
incorporates different types of explanatory reasoning. The gathering also spent 
time honoring the memory of Herbert Simon and Jan Zytkow, both of whom 
played seminal roles in the field of computational scientific discovery. 
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Further information on the symposium is available at the World Wide Web 
page http://www.isle.org/symposia/comdisc.html. This includes informa- 
tion about the speakers, abstracts of the presentations, and pointers to publi- 
cations related to their talks. Slides from the presentations can be found at the 
Web page http://math.nist.gov/~JDevaney/CommKnow/. Saso Dzeroski and 
Ljupco Todorovski are currently editing a book based on the talks given at the 
symposium. Information on the book will appear at the symposium page and 
the first author’s Web page as it becomes available. 
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Abstract. In Data Mining applications of the frequent sets problem, 
such as finding association rules, a commonly used generalization is to 
see each transaction as the characteristic function of the corresponding 
itemset. This allows one to find also correlations between items not being 
in the transactions; but this may lead to the risk of a large and hard to 
interpret output. We propose a bottom-up algorithm in which the ex- 
ploration of facts corresponding to items not being in the transactions is 
delayed with respect to positive information of items being in the trans- 
actions. This allows the user to dose the association rules found in terms 
of the amount of correlation allowed between absences of items. The al- 
gorithm takes advantage of the relationships between the corresponding 
frequencies of such itemsets. With a slight modification, our algorithm 
can be used as well to find all frequent itemsets consisting of an arbi- 
trary number of present positive attributes and at most a predetermined 
number k of present negative attributes. 
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1 Introduction 

Data Mining or Knowledge Discovery in Databases (KDD) is a field of increasing 
interest with strong connections with several research areas such as databases, 
machine learning, and statistics. It aims at finding useful information from large 
masses of data; see [5]. One of the most relevant subroutines in applications of 
this field is finding frequent itemsets within the transactions in the database. This 
task consists of finding highly frequent itemsets, by comparing their frequency 
of occurrence within the given database with a given parameter a. This problem 
can be solved by the well-known Apriori algorithm [2]. 

The Apriori algorithm is a method of searching the lattice of itemsets with 
respect to itemset inclusion. The strategy starts from the empty set and scans 
itemsets from smaller to larger in an incremental manner. The Apriori algorithm 
uses this strategy to effectively prune away a substantial number of unproductive 
itemsets. 

The frequent sets that result from this task can be used then to discover 
association rules that have support and confidence values no smaller than the 
user-specified minimum thresholds [1], or to solve other related Knowledge Dis- 
covery problems [7]. We do not discuss here how to form association rules from 
frequent itemsets, nor any other application of these; but focus on the perfor- 
mance of that very step, finding highly frequent patterns, whose complexity 
dominates by far the computational cost of many such applications. 

Here we considered the case where each transaction of the database is a 
binary-valued function of the attributes. The difference with the itemsets view 
is that now we look for patterns where the non-occurrence of an item is important 
too. This is formalized in terms of partial functions, which, on each item, may 
include it (value 1), exclude it (value 0), or not to consider it (undefined). 

It was noticed in [6] that essentially the same algorithms, with the same “a 
priori” pruning strategies, can be applied to many other settings in which one 
looks for a certain theory on a certain formal language according to a certain 
predicate that is monotone on a generalization/specialization relation. In par- 
ticular, our setting with binary-valued attributes falls into this category, and 
actually there exist implementations of the Apriori algorithm that solve the 
problem for the setting where each transaction is actually a function. Thus, they 
can be used to solve the problem of finding partial functions whose frequency is 
over some threshold. 

However, it is known that direct use of these algorithms on real life data 
frequently come up with extremely large numbers of frequent sets consisting 
“only of zeros”; for example, in the prototypical case of market basket data, 
certainly the number of items is overwhelmingly larger than the average number 
of items bought, and this means that the output of any frequent sets algorithm 
will contain large amounts of information of the sort “most of the times that 
scotch is not bought, bourbon is not bought either, with large support”. If such 
negative information is not desired at all, the original Apriori version can be 
used; but there may be cases where limited amounts of negative information 
are deemed useful, for instance looking for alternative products that can act 
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as mutual replacements, and yet one does not want to be forced into a search 
through the huge space of all partial functions. We are interested in producing 
algorithms that will provide frequent “itemsets” that have “missing” products, 
but in a controlled manner, so that they are useful when having some missing 
products in the itemsets is important but not so much as the products that are 
in the itemsets. 

Here we develop a variant of the Apriori algorithm that, if supplied with a 
limit k on the maximum number of negated attributes desired in the output 
frequent sets, will take advantage of this fact, and produce frequent itemsets for 
which this limit is obeyed. Of course, it does so in a much more efficient way 
than just applying Apriori and discarding the part of the output that does not 
fulfill this condition. First, because the exploration is organized in a way that 
naturally reflects the condition on the output. Second, because we know that 
items may be, or not be, in each itemset, but not both implies complementar- 
ity relationships between the frequencies of “itemsets” that contain, or do not 
contain, a given item. We use these relationships to And out frequencies of some 
“itemsets” without actually counting them, thus saving computational work. 



2 Preliminaries 

Now, we give the concepts that we will use along the paper. We consider a 
database T = {H, . . . , with N rows over a set i? = {Ai, . . . , A„} = {Ai : 
t G /} of binary- valued attributes, that can be seen as either items or columns; 
actually they just serve as a visual aid for their index set / = {1, . . . , n}. 

Each row, or transaction, maps R into {0, 1}. For A G R, we also write A Gti 
for ti{A) = 1 and, departing from standard use, A Gti for ti{A) = 0. Obviously, 
A G ti or A G ti but not both. The database is actually a multiset of transactions. 
Each transaction has a unique identifier. 

As for partial functions, they map a subset of R into {0, 1}; those that are 
defined for exactly t attributes are called Aitemsets. The goal of our algorithm 
will be to And frequent itemsets with any number of attributes mapped to 0 
and any number of attributes mapped to 1; but in some specific order. Our 
notation for these partial functions is as follows. For p G V{I) and s G V{I — p), 
(s n p = 0) we denote the subset A^’^ and identify it with the partial function 
mapping the subset A^ = {Ai : i G p} to 1, the subset A® = {Aj : j G s} to 
0 and undefined on the rest. Itemsets are called fc-negative itemsets where 
|s| = fc, fc = 0, . . . , n. If |s| = 0 then we have the positive itemset A^’®. 

We identify partial functions defined on a single attribute Aj, namely, A^A,^ 
or A^'A) ^ with the corresponding symbol Aj or Aj respectively. A transaction 
can be seen as a total function. An itemset can be seen as a partial function. 
If the partial function can be extended to the total function corresponding to 
a transaction then we say that an itemset is a subset of a transaction and we 
employ the standard symbol C for this case. 

The support of an itemset (or partial function) is defined as follows. 
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Definition 1. Let R = {Ax,... ,A„} = {Ai : i G 1} be a set of n items and 
let T = {ti,... An} be a database of transactions as before. The support or 
frequency of an itemset A is the ratio of the number of transactions on which it 
occurs as a subset to the total number of transactions. Therefore: 

,,,, \{tGT:ACt}\ 

MA) = 

Given a user-specified minimum support value (denoted by a), we say than 
an itemset A is frequent if its support is more than the minimum support, i.e. 
fr{A) > a. 

We introduce a natural structure in the itemset space by placing them into 
“floors” and “levels”. The floor k contains the itemsets with k negative at- 
tributes. In each floor, the itemsets are organized in levels (as usual): the level is 
the number of the attributes of the itemset. Thus, in floor zero we place positive 
itemsets, ordered by itemset inclusion (or equivalently, index set inclusion); in 
the first floor we place all itemsets with one attribute valued to 0, organized 
similarly, and related similarly to the itemsets in floor zero. In floor k we place 
all the itemsets with k attributes valued to 0, organized level wise in the standard 
way, and related similarly to the itemsets in other floors. 

Thus we are considering the order relation defined as follows: 

Definition 2. For p G V{I), s G V{I — p), q G V{I), and t G V{I — q), given 
partial functions X = A^'^ and Y = we denote by X A Y the fact that 

p C q and s Ct. 

With respect to this relation, the property of having frequency larger than 
any threshold is antimonotone, since X A Y implies fr{X) > fr{Y). Thus, 
whenever an itemset is not frequent enough, neither is any of its extensions, 
and this fact allows one to prune away a substantial number of unproductive 
itemsets. Therefore, frequent sets algorithms can be applied rather directly to 
this case. Our purpose now is to aim at a somewhat more refined algorithm. 

Now, we give a simple example to show the structure of the itemset space. 
This example will be useful to describe the frequent itemset candidate generation 
and the path that follows our algorithm for it. 

Example: Let R = {A, B,C, D} be the set of four items. In this case, we 
use four floors to represent the itemsets with any number of negative attributes 
and any number of positive attributes. In each rectangle, the pair (/, i) indicates 
the floor / (number of negative attributes in the itemsets of this rectangle) and 
level £ (cardinality of the itemsets of this rectangle). See figure 1. 

3 Algorithm Bounded-neg-Apriori 

Our algorithm performs the same computations as Apriori on the zero floor, 
but then uses the frequencies computed to try to reduce the computational 
effort spent on 1-negative itemsets. This process goes on along all floors. Overall, 
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Fig. 1. The structure of the itemset space 



bounded-neg-Apriori can be seen as a refinemet of Apriori in which the explicit 
evaluation of the frequency of fc-negative itemsets is avoided, since it can be 
obtained from some itemsets of the previous floor, if they are processed in the 
appropriate order. 

We use a number of very easy properties of the frequencies. Of course all of 
the frequencies are real numbers in [0, 1]. 

Proposition 1. Let p e V{I) he arbitrary, and s € V{I — p) with |s| > 1. 

1. For each j G s, fr{AP’^) = fr{AP’^~^^^) — and, 

fr{A^’^) = 1 

2. AP’^ is frequent iff3j G s, fr{AP’‘^~^^^) > a + . 

Remark 1: Each of the up to |s|-many ways of decomposing /r(A^’®) in part 
1 leads to the same result: if fr{AP’^~^^^) < a, for any j G s, then is not 
frequent. 

We will also use the following easy properties regarding the relation of the 
threshold a to the value one-half. They allow for some extra pruning to be 
done for quite high frequency values (although this case might be infrequently 
occurring in practice). 

Proposition 2. Let p G V{I) he arbitrary, and s G V{I — p), arbitrary for 
statements not depending on p. 

1. \fr{Aj) — 0.5| <\a — 0.5| \fr{Aj) — 0.5| < |cr — 0.5|. 

2. If a < 0.5 then fr{Aj) < a ^ fr(Aj) > a and fr{Aj) > 1 — cr fr{Aj) < 
a. 

3. If a > 0.5 then fr{Aj) > u ^ fr(Aj) < a and fr{Aj) < 1 — cr fr{Aj) > 
a. 

4- Vj G s, if u > 0.5 and fr{AP’^~^^^) > a + then 

j^(yipu{i},s-{i}) ^ J g^gg j\p^{j},s-{j} j^g not frequent. 

Remark ^ If ct > 0.5 and 3j G s / > cr then is not frequent. 
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3.1 Candidate Generation 

Moving to the next round of candidates once all frequent ^-itemsets have been 
identified corresponds to moving up, in all possible ways, one step within the 
same floor, and climbing up in all possible ways to the next floor. 

More formally, at the floor zero, frequent set leads to consideration as 
potential candidates of the following itemsets: all A*’® where q = p\J {i} and all 
^ for j ^ p. Also, itemset would lead to for g = p U {!}, for 

i ^p and i yf j; our algorithm does not use this last sort of steps. 

In the other floors the movements are in the same form. For all p € V{I) and 
s yf 0, from we can climb up to the next floor to A^’* where t = s U {j}, 
for j G V{I — {sU p}) . Also, itemset AP’^ would lead to A"?’® for g = p U {t}, for 
i (fi p and i (ji s but we will not use such steps either. 

Therefore the scheme of the search of frequent itemsets with k 0-valued at- 
tributes (i.e. in the floor k) is based on the following: whenever enough fre- 
quencies in the previous floor are known to test it, if /r(A^’’®“^-l^) > a + 
/^(APU{y},s-{i}) -v^here j € s, then we know fr^AP'^^) > cr so that it can be 
declared frequent; moreover, for a > 0.5 this has to be tested only when that 
out to be nonfrequent although AP’^~^^^ was frequent. 

Example: Let us turn our atention again to the example. Let us suppose 
that cr < 0.5; we explain the process of candidate generation and the path that 
our algorithm follows for it. Suppose that the maximal itemsets to be found are 
ABC, ABC, and AB. Thus, A, B, C are frequent items, and also B and C are 
frequent ’negative items’. At the initialization, we And that D, A, and D cannot 
appear in any frequent itemset. The algorithm stores this information by means 
of the set I (defined later). In the following step, we take into consideration 
as potential candidates, firstly the itemsets in (0,2), secondly in (1,2), and at 
last, in (2,2) that verify the conditions. There we And the frequent itemsets are 
AB, AC , BC, AB, AC, BC. At this moment, we know that there do not exist 
frequent itemsets in (2, 2). So, there will not exist frequent itemsets in (/, t) with 
f > 2, £ > 2 and £ > f. This information is used in the algorithm by means 
of the set J (defined later) to refine the search of candidate generation. In the 
following step we scan for frequent itemsets in (0,3) and (1,3) and ABC, ABC 
are frequent itemsets, and the exploration of the next level proves that, together 
with AB, they are the maximal frequent itemsets. Along the example it is clear 
how the algorithm would proceed in case we are given a bound on the number 
of negative attributes present: this would just discard floors that do not obey 
that limitation. 



3.2 The Algorithm 

Now, we present the algorithm in a more precise form. The algorithm has as 
input the set of attributes, the database, and the threshold a on the support. 
The output of the algorithm is the set of all frequent itemsets with negative and 
positive itemsets. Also, a similar algorithm can be easily developed to And the 




56 



I. Fortes, J.L. Balcazar, and R. Morales 



set of all frequent itemsets with at most k negative attributes: simply impose 
explicitly the bound k on the corresponding loop in the algorithm. 

Let us consider the symbol / for the floor (that is the number of negative 
attributes of the itemset, 0 < / < n) and the symbol £ for the level (the number 
of the attributes of the itemset 0 < ^ < n): we will write the sets Cfj and Lf^i 
for candidates and frequent itemsets respectively. At the beginning we suppose 
that all Cf^i and Lf^i for f < £ <n are empty. 

With respect to this notation our algorithm traces the following path: 

(0, 1), (1, 1); (0, 2), (1, 2), (2, 2); (0, 3), (1, 3), (2, 3), (3, 3);, etc (recall to the exam- 
ple). 

Now, we present the algorithm in a pseudocode style. For clarity, main loops 
are commented. After the algorithm we included additional comments about 
some instructions that improve the search of frequent itemsets. 

Algorithm bounded-neg-Apriori 

1. set current floor / := 0 
set current level £ := 1 

“This set is explained after the algorithm” 

J:=0 

2. “Initially, we find the frequent itemsets with isolated positive attributes” 

Lf^e := Vi G ///r(AW’®) > a} 

3. “This is the main loop to climb up floors” 
while f < £ and £ < n do 

while Ly £ yf 0 and f < £ and £ < n do 
k:=f + l 
■= 0 

“At this moment we can obtain the frequent itemsets of the upper” 
“floors at same level from the itemsets in the previous floor” 

“There are two cases according to <t” 

while k < £ do 
if k ^ J then 

Lk,e '■= 0 

if a < 0.5 then (1) 

Ck,i ■■= {AP’'*/ AP’®' G Lk- 14 -i, m G / - (p U s'), s = s' U {m}, 

Vi G p,AP-{A.- g Vj G s, AP>^-{A g 

else 

Ck,i ■■= {AP’V AP’®' G Lfc_iy_i, m G / - (p U s'), s = s' U {m}, 

Vi G p,AP-{A,- g Vj G s, AP>®-{A g Lk-i,e-i, 

Vj G s, /r(AP^^i^’®“^-^^) < a} 

fi 

Lk,e := {AP-® G Ck,e/3j G s, /r(AP>®-{A) > + /^(ylPub}.«-{A)} 

if Lk i = 0 then J := J U {k} fl 

fi 

if f = 1 and k = 1 and Lip yf 0 then I := {i/A®’^A g Lip} fi (2) 
set current floor fc := fc -I- 1 
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od (while k) 

“Selected a floor we look for the frequent itemsets in next level” 

“into this floor” 
set current level ^ := ^ + 1 
J \= J \J {k + 1/ k € J,k < n\ 
if / = 0 then 

:= {AP’VVi G P, AP-W’0 G 
L/., := {AP-0GC/,^//r(AP’0) >a} 
else 

Cf,t := {AP’VVt Gp,AP-{*>>^ G 

VjGs, GL/_i,^_i} 

L := { G /r (AP>^) > a} (3) 

fi 

od (while t) 

“If the maximum level into a floor is reached then we must go over” 

“to the next floor at this maximum level” 
set current floor /:=/ + ! 

Cf,i := {AP’VVf G p, G G s, e L/- 1 /- 1 } 

:= {AP’" G G s,/r(AP’"-{^>) >a + /r(AP'^{j>’"-{^>)} 

od (while /) 

4. output u 

k<e<n 



The algorithm refines the search of frequent itemsets by means of the set J. 
In each level, J indicates the floors where no frequent itemsets will exist. 

In the sentence labeled (3) the generation of candidates and the computation of 
their frequencies must be done by considering cr (less or more than 0.5), as in 
the instruction labeled (1). 

Note that, by the sentence labeled (2), the only negative attributes that could 
appear in the candidate itemsets are the elements of Tip. So, we use this set, as 
soon as it is computed, to refine the index set I used later along the computation. 

With respect to the complexity of the algorithm, from a theoretical point 
of view, two aspects are considered: candidate generation and itemset frequence 
computation. In the candidate generation the worst case is reached when the 
threshold cr is less or equal to 0.5. In this case, two itemsets one of them with a 
particular attribute positive and the other itemset with the same attribute neg- 
ative can be frequent simultaneously. If ct > 0.5 then by remark 2 in proposition 
2 the generation is refined. Independtly of the a value the sets I and J refine 
the candidate generation. So, the needed requirements can be reduced. 

In the itemset frequence computation only itemsets with positive attributes 
are computed directly from the database. The frequencies of the other candidate 
itemsets with any number of negative attributes are obtained by using proposi- 
tion one. Therefore, the number of passes through the database is like in Apriori, 
i.e., n+1, where n is the greatest frequent itemset. 
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4 Conclusions and Future Work 

In cases where the absence of some items from a transaction is relevant but one 
wants to avoid the generation of many rules relating these absences, it can be 
useful to allow for a maximum of k such absences from the frequent sets; even if 
no good guess exists for k, it may be useful to organize the search in such a way 
that the itemsets with m items show up in the order mandated by how many of 
them are positive: first all positive, then m — 1 positive and one negative, and 
so on. Our algorithm allows one to do it and takes advantage of a number of 
facts, corresponding to relationships between the itemset frequencies, to avoid 
the counting of some candidates. 

Of course, it makes sense to try to combine this strategy together with other 
ideas that have been used together with Apriori, like random sampling to eval- 
uate the frequencies, or instead of Apriori, like alternative algorithms such as 
DIG [4] or Ready-and-Go [3]. Experimental developments, as well as more de- 
tailed analyses and a careful formalization of the setting, can lead to improved 
results, and we continue to work along these two lines. 
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Abstract. The design of algorithms that explore multiple represen- 
tation languages and explore different search spaces has an intuitive 
appeal. In the context of classihcation problems, algorithms that 
generate multivariate trees are able to explore multiple representation 
languages by using decision tests based on a combination of attributes. 
The same applies to model trees algorithms, in regression domains, 
but using linear models at leaf nodes. In this paper we study where 
to use combinations of attributes in regression and classification tree 
learning. We present an algorithm for multivariate tree learning that 
combines a univariate decision tree with a linear function by means 
of constructive induction. This algorithm is able to use decision nodes 
with multivariate tests, and leaf nodes that make predictions using 
linear functions. Multivariate decision nodes are built when growing 
the tree, while functional leaves are built when pruning the tree. 
The algorithm has been implemented both for classification problems 
and regression problems. The experimental evaluation shows that our 
algorithm has clear advantages with respect to the generalization ability 
when compared against its components, two simplihed versions, and 
competes well against the state-of-the-art in multivariate regression and 
classification trees. 

Keywords: Decision Trees, Multiple Models, Supervised Machine 
Learning. 



1 Introduction 

The generalization ability of a learning algorithm depends on the appropriateness 
of its representation language to express a generalization of the examples for the 
given task. Different learning algorithms employ different representations, search 
heuristics, evaluation functions, and search spaces. It is now commonly accepted 
that each algorithm has its own selective superiority [3]; each is best for some 
but not all tasks. The design of algorithms that explore multiple representation 
languages and explore different search spaces has an intuitive appeal. This paper 
presents one such algorithm. 

In the context of supervised learning problems it is useful to distinguish 
between classification problems and regression problems. In the former the target 
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variable takes values in a finite and pre-defined set of un-ordered values, and the 
usual goal is to minimize a 0-1-loss function. In the later the target variable 
is ordered and takes values in a subset of 5ft. The usual goal is to minimize a 
squared error loss function. Mainly due to the differences in the type of the target 
variable successful techniques in one type of problems are not directly applicable 
to the other type of problems. 

The supervised learning problem is to find an approximation to an unknown 
function given a set of labelled examples. To solve this problem, several methods 
have been presented in the literature. Two of the most representative methods 
are the General Linear Model and Decision trees. Both methods explore different 
hypothesis space and use different search strategies. In the former the goal is to 
minimize the sum of squared deviations of the observed values for the dependent 
variable from those predicted by the model. It is based on the algebraic theory of 
invariants and has an analytical solution. The description language of the model 
takes the form of a polynomial that, in its simpler form, is a linear combination 
of the attributes: ruo + X) ^ This is the basic idea behind linear-regression 
and discriminant functions [8] . The latter use a divide- and- conquer strategy. The 
goal is to decompose a complex problem into simpler problems and recursively 
applying the same strategy to the sub-problems. Solutions of the sub-problems 
are combined in the form of a tree. Its hypothesis space is the set of all possible 
hyper-rectangular regions. The power of this approach comes from the ability to 
split the space of the attributes into subspaces, whereby each subspace is fitted 
with different functions. This is the basic idea behind well-known tree based 
algorithms [2,13]. 

In the case of classification problems, a class of algorithms that explore multi- 
ple representation languages are the so called multivariate trees [2,20,12,6,11]. In 
this sort of algorithms decision nodes can contain tests based on a combination 
of attributes. The language bias of univariate decision trees (axis parallel splits) 
are relaxed allowing decision surfaces oblique with respect to the axis of the 
instance space. As in the case of classification problems, in regression problems 
some authors have studied the use of regression trees that explore multiple rep- 
resentation languages, here denominated model trees [2,13,15,21,18]. But while 
in classification problems multivariate decisions appear in internal nodes, in re- 
gression problems multivariate decisions appear in leaf nodes. The problem that 
we study in this paper is where to use decisions based on combinations of at- 
tributes. Should we restrict combinations of attributes to decision nodes? Should 
we restrict combinations of attributes to leaf nodes? Could we use combinations 
of attributes both at decision nodes and leaf nodes? 

The algorithm that we present here is an extension of multivariate trees. It 
is applicable to regression and classification domains, allowing combinations of 
attributes both at decision nodes and leaves. In the next section of the paper 
we describe our proposal to functional trees. In Section 3 we discuss the differ- 
ent variants of multivariate models using an illustrative example on regression 
domains. In Section 4 we present related work both in the classification and re- 
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gression settings. In Section 5 we evaluate our algorithm on a set of benchmark 
regression and classification problems. Last Section concludes the paper. 

2 The Algorithm for Constructing Functional Trees 

The standard algorithm to build univariate trees consists of two phases. In the 
first phase a large tree is constructed. In the second phase this tree is pruned 
back. The algorithm to grow the tree follows the standard divide-and-conquer 
approach. The most relevant aspects are: the splitting rule, the termination 
criterion, and the leaf assignment criterion. With respect to the last criterion, 
the usual rule consists of assignment of a constant to a leaf node. Considering 
only the examples that fall at this node, the constant is usually the majority class 
in classification problems or the mean of the y values in the regression setting. 
With respect to the splitting rule, each attribute value defines a possible partition 
of the dataset. We distinguish between nominal attributes and continuous ones. 
In the former the number of partitions is equal to the number of values of the 
attribute, in the latter a binary partition is obtained. To estimate the merit 
of the partition obtained by a given attribute we use the gain ratio heuristic 
for classification problems and the decrease in variance criterion for regression 
problems. In any case, the attribute that maximizes the criterion is chosen as 
test attribute at this node. 

The pruning phase consists of traversing the tree in a depth-first fashion. 
At each non-leaf node two measures should be estimated. An estimate of the 
error of the subtree above this node, that is computed as a weighted sum of 
the estimated error for each leaf of the subtree, and the estimated error of the 
non-leaf node if it was pruned to a leaf. If the later is lower than the former, the 
entire subtree is replaced to a leaf. 

All of these aspects have several and important variants, see for example [2, 
14]. Nevertheless all decision nodes contain conditions based on the values of one 
attribute, and leaf nodes predict a constant. 



2.1 Functional Trees 

In this section we present the general algorithm to construct a functional tree. 
Given a set of examples and an attribute constructor, the main algorithm used 
to build a functional tree is presented in Figure 1. This algorithm is similar to 
many others, except in the constructive step (steps 2 and 3). Here a function is 
built and mapped to new attributes. There are some aspects of this algorithm 
that should be made explicit. In step 2, a model is built using the Constructor 
function. This is done using only the examples that fall at this node. Later, in 
step 3, the model is mapped to new attributes. Actually, the constructor function 
should be a classifier or a regressor depending on the type of the problem. In 
the case of regression problems the constructor function is mapped to one new 
attribute, the y value predict by the constructor. In the case of classification 
problems the number of new attributes is equal to the number of classes. Each 
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Function Tree(Dataset, Constructor) 

1. If Stop_Criterion(DataSet) 

— Return a Leaf Node with a constant value. 

2. Construct a model ^ using Constructor 

3. For each example x £ DataSet 

— Compute y = 

— Extend x with a new attribute y. 

4. Select the attribute from both original and all newly constructed attributes that 
maximizes some merit-function 

5. For each partition i of the DataSet using the selected attribute 

— Treei = Tree(Dataseti, Constructor) 

6. Return a Tree, as a decision node based on the select attribute, containing the $ 
model, and descendents Treei. 

End Function 



Fig. 1. Building a Functional Tree 



new attribute is the probability that the example belongs to one class^ given 
by the constructed model. The merit of each new attribute is evaluated using 
the merit-function of the univariate tree, and in competition with the original 
attributes (step 4). The model built by our algorithm has two types of decision 
nodes: those based on a test of one of the original attributes, and those based on 
the values of the constructor function. When using Generalized Linear Models 
(GLM) [16] as attribute constructor, each new attribute is a linear combination 
of the original attributes. Decision nodes based on constructed attributes defines 
a multivariate decision surface. 

Once a tree has been constructed, it is pruned back. The general algorithm to 
prune the tree is presented in Figure 2. To estimate the error at each leaf (step 1) 
we distinguish between classification and regression problems. In the former we 
assume a binomial distribution using a process similar to the pessimistic error of 
G4.5. In the latter we assume a distribution of the variance of the cases in it 
using a process similar to the pruning described in [18]. A similar procedure 
is used to estimate the constructor error (step 3). The pruning algorithm pro- 
duces two different types of leaves: Ordinary Leaves that predict a constant, and 
Constructor Leaves that predict the value of the Gonstructor function learned 
(in the growing phase) at this node. 

By simplifying our algorithm we obtain different conceptual models. Two 
interesting lesions are described in the following sub-sections. 



Bottom-Up Approach. We denote as Bottom-Up Approach to functional trees 
when the functional models are used exclusively at leaves. This is the strategy 

^ At different nodes the system considers different number of classes depending on the 
class distribution of the examples that fall at this node. 




Functional Trees 



63 



Function Prune(Tree) 

1. Estimate Leaf_Error as the error at this node. 

2. If Tree is a leaf Return Leaf_Error. 

3. Estimate Constructor _Error as the error of ^ 

4. For each descendent i 

— Backed_Up_Error += Prune(Treei) 

5. If argmin(Leaf_Error,Constructor_Error,Backed_Up_Error) 

— Is Leaf-Error 

• Tree = Leaf 

• Tree_Error = LeaLError 
— Is ModeLError 

• Tree = Constructor Leaf 

• Tree-Error = Coustructor_Error 
— Is Backed_Up_Error 

• Tree_Error = Backed_Up_Error 

6. Returu Tree_Error 

End Function 



Fig. 2. Pruning a Functional Tree 



used for example in M5 [15,21], and in NBtree system [10]. In our tree algorithm 
this is done restricting the selection of the test attribute (step 4 in the growing 
algorithm) to the original attributes. Nevertheless we still build, at each node, 
the constructor function. The model built by the constructor function is used 
later in the pruning phase. In this way, all decision nodes are based in the original 
attributes. Leaf nodes could contain a constructor model. A leaf node contains 
a constructor model if and only if in the pruning algorithm the estimated error 
of the constructor model is lower than the Backed-up-error and the estimated 
error of the node has if a leaf replaced it. 



Top-Down Approach. We denote as Top-Down Approach to functional trees 
when the multivariate models are used exclusively at decision nodes (internal 
nodes). In our algorithm, restricting the pruning algorithm to choose only be- 
tween the Backed_Up_Error and the Leaf-Error obtain these kinds of models. 
In this case all leaves predict a constant value. This is the strategy used for ex- 
ample in systems like LMDT [20], OCl [12], and Ltree [6]. 

Functional trees extend and generalize multivariate trees. Our algorithm can 
be seen as a hybrid model that performs a tight combination of a univariate 
tree and a GLM function. The components of the hybrid algorithm use different 
representation languages and search strategies. While the tree uses a divide-and- 
conquer method, the linear-regression performs a global minimization approach. 
While the former performs feature selection, the later uses all (or almost all) 
the attributes to build a model. From the point of view of the bias-variance 
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decomposition of the error [1] a decision tree is known to have low bias but high 
variance, while GLM functions are known to have low variance but high bias. 
This is the desirable behaviour for components of hybrid models. 



3 An Illustrative Example 

In this section we use the well-known regression dataset Housing to illustrate 
the different variants of functional models. The attribute constructor used is the 
linear regression function. Figure 3(a) presents a univariate tree for the Housing 




Fig. 3. (a)The Univariate Regression Tree and (b) Top-Down regression tree for the 
Housing problem. 



dataset. Decision nodes only contain tests based on the original attributes. Leaf 
nodes predict the average of y values taken from the examples that fall at the 
leaf. 

In a top-down multivariate tree (Figure 3(b)) decision nodes could contain 
(not necessarily) tests based on a linear combination of the original attributes. 
The tree contains a mixture of learned attributes, denoted as LR Node, and 
original attributes, e.g. AGE, DIS. Any of the linear-regression attributes can 
be used both at the node where they have been created and at deeper nodes. 
For example, the LR Node 19 has been created at the second level of the tree. It 
is used as test attribute at this node, and also (due to the constructive ability) 
as test attribute at the third level of the tree. Leaf nodes predict the average 
of y values of the examples that fall at this leaf. In a bottom-up multivariate 
tree (Figure 4(a)) decision nodes only contain tests based on the original at- 
tributes. Leaf nodes could predict (not necessarily) values obtained by using a 
linear-regression function built from the examples that fall at this node. This is 
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Fig. 4. (a)The Bottom-Up Multivariate Regression Tree and (b) The Multivariate Re- 
gression Tree for the Housing problem. 



the kind of multivariate regression trees that usually appears on the literature. 
For example, systems M5 [15,21] and RT [18] generate this kind of models. Fig- 
ure 4(b) presents the full multivariate regression tree using both top-down and 
bottom-up multivariate approaches. In this case, decision nodes could contain 
(not necessarily) tests based on a linear combination of the original attributes, 
and leaf nodes could predict (not necessarily) values obtained by using a linear- 
regression function built from the examples that fall at this node. 

Figure 5 illustrates the functional models in the case of a classification prob- 
lem. We have used the UCI dataset Learning Qualitative Structure Activity Rela- 
tionships - QSARs pyrimidines to illustrate the different variants of tree models. 
This is a complex two classes problem defined by 54 continuous attributes. The 
attribute constructor used is the LinearBayes [5] classifier. In a bottom-up func- 
tional tree (Figure 5(a)) decision nodes only contain tests based on the original 
attributes. Leaf nodes could predict (not necessarily) values obtained by using 
a LinearBayes function built from the examples that fall at this node. Figure 
5(b) presents the functional tree using both top-down and bottom-up multivari- 
ate approaches. In this case, decision nodes could contain (not necessarily) tests 
based on a linear combination of the original attributes, and leaf nodes could 
predict (not necessarily) values obtained by using a LinearBayes function built 
from the examples that fall at this node. 

4 Related Work 

Breiman et. al. [2] presents the first extensive and in-depth study of the problem 
of constructing decision and regression trees. But, while in the case of decision 
trees they consider internal nodes with a test based on linear combination of 
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Fig. 5. (a)The Bottom-Up Functional Tree and (b) the Functional Tree for the QSARs 
problem. 



attributes, in the case of regression trees internal nodes are always based on a 
single attribute. 

In the context of classification problems, several algorithms have been pre- 
sented that could use at each decision node tests based on linear combination of 
the attributes [2,12,20,6]. The most comprehensive study on multivariate trees 
has been presented by Brodley and Utgoff in [4]. Brodley and Utgoff discusses 
several methods for constructing multivariate decision trees: representing a mul- 
tivariate test, including symbolic and numeric features, learning the coefficients 
of a multivariate test, selecting the features to include in a test, and pruning 
of multivariate decision trees. Brodley only considers multivariate tests at inner 
nodes in a tree. In this context few works consider functional tree leaves. One of 
the earliest work is the Percepton tree algorithm [19] where leaf nodes may imple- 
ment a general linear discriminant function. Also Kohavi[10] has presented the 
naive Bayes tree that uses functional leaves. NBtree is a hybrid algorithm that 
generates a regular univariate decision tree, but the leaves contain a naive Bayes 
classifier built from the examples that fall at this node. The approach retains the 
interpret ability of naive Bayes and decision trees, while resulting in classifiers 
that frequently outperform both constituents, especially in large datasets. Also, 
Gama [7] has presented Cascade Generalization, a method to combine classifi- 
cation algorithms by means of constructive induction. The work presented here, 
near follows Cascade method but extended for regression domains and allowing 
models with functional leaves. 

In regression domains, Quinlan [13] has presented system M5. It builds mul- 
tivariate trees using linear models at the leaves. In the pruning phase for each 
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leaf a linear model is built. Recently, Witten and Eibe [21] have extended M5. 
A linear model is built at each node of the initial regression tree. All the models 
along a particular path from the root to a leaf node are then combined into 
one linear model in a smoothing step. Also Karalic [9] has studied the influence 
of using linear regression in the leaves of a regression tree. As in the work of 
Quinlan, Karalic shows that it leads to smaller models with increase of perfor- 
mance. Torgo [17] has presented an experimental study about functional models 
for regression tree leaves. Later, the same author [18] has presented the system 
RT. Using RT with linear models at the leaves, RT builds and prunes a regular 
univariate tree. Then at each leaf a linear model is built using the examples that 
fall at this leaf. 



5 Experimental Evaluation 

It is commonly accepted that multivariate regression trees should be competitive 
against univariate models. In this section we evaluate the proposed algorithm, 
its simplified variants, and its components on a set of classification and regres- 
sion benchmark problems. In regression problems the constructor is a standard 
linear regression function. In classification problems the constructor is the Lin- 
earBayes classifier [5]. For comparative proposes we evaluate also system M5^. 
The main goal in this experimental evaluation is to study the influence in terms 
of performance of the position inside a regression and a classification tree of the 
linear models. We evaluate three situations: 

— Trees that could use linear combinations at each internal node. 

— Trees that could use linear combinations at each leaf. 

— Trees that could use linear combinations both at each internal and leaf nodes. 

All evaluated models are based on the same tree growing and pruning algorithm. 
That is, they use exactly the same splitting criteria, stopping criteria, and prun- 
ing mechanism. Moreover they share many minor heuristics that individually 
are too small to mention, but collectively can make difference. Doing so, the dif- 
ferences on the evaluation statistics are due to the differences in the conceptual 
model. 

In this work we estimate the performance of a learned model using 10 fold 
cross validation. To minimize the influence of the variability of the training set, 
we repeat this process ten times, each time using a different permutation of the 
dataset. The final estimate is the mean of the performance statistic obtained 
in each run of the cross validation. For regression problems the performance is 
measured in terms of the mean squared error statistic. For classification prob- 
lems the performance is measured in terms of the error rate statistic. To apply 
pairwise comparisons we guarantee that, in all runs, all algorithms learn and 
test on the same partitions of the data. We compare the performance of the 

® We have used M5 from version 3.1.8 of the Weka environment. We have used several 
regression systems. The most competitive was M5. 
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functional tree (FT) against its components: the univariate tree (UT) and the 
constructor function (linear regression (LR) in regression problems, and Lin- 
earBayes (LB) in classification problems). The functional tree is also compared 
against to the two simplified versions: Bottom-up (FT-B) and Top-Down (FT-T). 
For each dataset, comparisons between algorithms are done using the Wilcoxon 
signed ranked paired-test. The null hypothesis is that the difference between 
performance statistics has median value zero. We consider that a difference in 
performance has statistical significance if the p value of the Wilcoxon test is less 
than 0.01. 



5.1 Results in Regression Domains 

We have chosen 20 datasets from the Repository of Regression problems at LI- 
ACC^. The choice of datasets was restricted by the criteria that almost all the 
attributes are ordered with few missing values® . The number of examples varies 
from 43 to 40768. The number of attributes varies from 5 to 48. The results in 
terms of MSE and standard deviation are presented in Table 1. The first two 
columns refer to the results of the components of the hybrid algorithm. The 
following three columns refer to the simplified versions of our algorithm and the 
full model. The last column refers to the M5 system. For each dataset, the algo- 
rithms are compared against the full multivariate tree using the Wilcoxon signed 
rank-test. A — (-I-) sign indicates that for this dataset the performance of the 
algorithm was worse (better) than the full model with a p value less than 0.01. 

Table 1 presents a comparative summary of the results. The first line presents 
the geometric mean of the MSE statistic across all datasets. The second line 
shows the average rank of all models, computed for each dataset by assigning 
rank 1 to the best algorithm, 2 to the second best and so on. The third line shows 
the average ratio of MSE. This is computed for each dataset as the ratio between 
the MSE of one algorithm and the MSE of M5. The fourth line shows the num- 
ber of significant differences using the signed-rank test taking the multivariate 
tree FT as reference. We use the Wilcoxon Matched-Pairs Signed-Ranks Test 
to compare the error rate of pairs of algorithms across datasets®. The last line 
shows the p values associated with this test for the MSE results on all datasets 
and taking FT as reference. It is interesting to note that the full model (FT) 
significantly improves over both components (LR and UT) in 14 datasets out 
of 20. All the multivariate trees have a similar performance. Using the signifi- 
cant test as criteria, FT is the most performing algorithm. It is interesting to 
note that the bottom-up version is the most competitive algorithm. The ratio 
of significant wins/losses between the bottom-up and top-down versions is 4/3. 

^ http://www.ncc.up.pt/~ltorgo/Datasets 

® In regression problems, the actual implementation ignores missing values at learning 
time. At application time, if the value of the test attribute is unknown, all descendent 
branches produce a prediction. The final prediction is a weighted average of the 
predictions. 

® Each pair of data points consists of the estimate MSE on one dataset and for the 
two learning algorithms being compared. 
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Table 1. Summary of Results iu Regressiou Problems (MSE). 





L. Regression 


Univ. Tree 


Functional Trees 






Data 


(LR) 


(UT) 


Top 


Bottom 


FT 


M5 


Abalone 


- 4.9O8TO.0 


- 5.728T0.1 


4.616TO.O 


- 4.759T0.0 


4.602T0.0 


4.553T0.5 


Auto-mpg 


- 11.470±0.1 


- 19.409T1.2 


+ 8.921T0.4 


9.560T0.8 


9.131T0.5 


7.958T3.5 


Cart 


- 5.684T0.0 


+ 0.995T0.0 


- 1.016T0.0 


+ 0.993T0.0 


1.012T0.0 


0.994T0.0 


Computer 


- 99.907T0.2 


- 10.955T0.6 


- 6.426T0.6 


- 6.507T0.5 


6.284T0.6 


- 8.081T2.7 


Cpu 


- 3734T1717 


- 4111T1657 


- 1760T389 


- 1197T161 


1070T137 


1092T1315 


Diabetes 


0.399T0.0 


- 0.535T0.0 


- 0.500T0.0 


0.400TO.O 


0.399T0.0 


0.446T0.3 


Elevators 


- 1.02e-5±0.0 


- 1.4e-5±0.0 


- 0.86e-5±0.0 


0.5e-5±0.0 


0.5e-5±0.0 


0.52e-5±0.0 


Fried 


- 6.924TO.O 


- 3.474T0.0 


- 1.862T0.0 


- 2.348TO.O 


1.850TO.0 


- 1.938T0.1 


H. Quake 


0.036T0.0 


O.036TO.0 


0.036T0.0 


O.O36TO.0 


0.036T0.0 


0.O36TO.0 


House(16H) 


-2.06e9±6.1e5 


- 1.69e9±3.3e7+ 1.20e9±2.2e7 : 


L.19e9±3.0e7 1.23e9±2.2e7 


1.27e9±1.2e8 


House(8L) 


-1.73e9±8.2e5 


- 1.19e9±1.2e7+ 1.01e9±1.3e7 


L.02e9±9.2e6 1.02e9±1.3e7 9.97e8±7.1e7 


House(Cal) 


-4.81e9±2.0e6 


- 3.69e9±3.5e7 


- 3.09e9±2.7e7+ 2.78e9±2.8e7 3.05e9±3.1e7 3.07e9±2.8e8 


Housing 


- 23.840T0.2 


- 19.591T1.7 


16.251T1.1 + 13.359T1.7 


16.538T1.3 


12.467T7.5 


Kinematics 


- 0.041T0.0 


- O.035TO.0 


- 0.027T0.0 


- 0.026T0.0 


0.023T0.0 


- O.025TO.0 


Machine 


- 5952T2053 


- 6036T1752 


3473T673 


3300T757 


3032T759 


3557T4271 


Pole 


- 930.08T0.3 


+ 48.55T1.2 


79.48T2.6 


+ 35.16T0.7 


79.31T2.4 


+ 42.0T5.8 


Puma32 


- 7.2e-4±0.0 


- l.le-4±0.0 


+ 0.71e-4±0.0 


0.82e-4±0.0 


0.82e-4±0.0 


0.67e-4±0.0 


PumaS 


- 19.925T0.0 


- 13.307T0.2 


+ 11.047T0.1 


11.145T0.1 


11.241±0.1 + 10.299T0.5 


Pyrimidines 


- O.018TO.0 


O.014TO.0 


+ O.OlOiO.O 


O.013TO.0 


0.013T0.0 


O.012TO.0 


Triazines 


- 0.025T0.0 


+ O.019TO.0 


- O.018TO.0 


0.023T0.0 


0.023T0.0 


0.017T0.0 


Summary of MSE Results 






LR 


UT FT-T 


FT-B 


FT 


M5 


Geometric Mean 


39.2 23.59 17.68 


16.47 


16.90 


16.2 


Average Rank 


5.4 


4.9 3.15 


2.9 


2.5 


2.3 


Average Ratio 


4.0 


1.57 1.13 


1.03 


1.07 


1 


Wins / Losses 


1/19 


4/16 8/12 


6/11 


- 


11/9 


Signi. Wins/Losses 


0/18 ; 


3/15 6/9 


4/5 


- 


2/3 


Wilcoxon Test 


0.0 


0.02 0.21 


0.1 


- 


0.23 



Nevertheless there is a computational cost associated with the increase in per- 
formance verified. To run all the experiments referred here, FT requires almost 
1.8 more time than the univariate regression tree. 



5.2 Results in Classification Problems 



We have chosen 30 datasets from the UCI repository. For comparative purposes 
we also evaluate M5' [21]. M5' decomposes a n-classes classification problem into 
n—1 binary regression problems^. The results in terms of error-rate and standard 
deviation are presented in Table 2. The first two columns refer to the results of 
the components of our system, the LinearBayes and the univariate tree. The 
next two columns refer to the lesioned versions of the algorithm, the Bottom- 
Up (FT-B) and Top-Down (FT-T). The fifth column refers to the full proposed 



^ We have used other multivariate trees. The most competitive was M5b 
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Table 2. Summary of Error Rate Results 



Dataset 


LinBayes 

LB 


Univ. Tree 
UT 


Functional Trees 
Bottom Top FT 


MS' 


Adult 


- 17.012±0.5 


14.178±0.5- 


14.307±0.4 


13.800±0.4 13.830±0.4 


- 15.182±0.6 


Australian 


13.498±0.3 


14.750±1.0- 


14.343±0.4 


13.928±0.6 13.638±0.6 


14.643±5.2 


Balance 


- 13.355±0.3 


- 22.467±1.1 - 


10.445±0.6 


7.313±0.9 7.313±0.9 


- 13.894±3.2 


Banding 


23.681±1.0 


23.512±1.8 


23.512±1.8 


23.762±2.2 23.762±2.2 


22.619±5.3 


Breast(W) 


+ 2.862±0.1 


- 5.123±0.2 - 


- 4.337±0.1 


3.346±0.4 3.346±0.4 


5.137±3.1 


Cleveland 


16.134±0.4 


- 20.995±1.4 + 


15.952±0.5 


17.369±0.9 16.675±0.8 


17.926±8.0 


Credit 


+ 14.228±0.1 


14.608±0.5 


14.784±0.5 


15.103±0.4 15.220±0.6 


14.913±3.7 


Diabetes 


+ 22.709±0.2 


- 25.348±1.0 


23.998±1.0- 


25.206±0.9 23.658±1.0 


25.002±4.8 


German 


24.520±0.2 


28.240±0.7 + 


23.630±0.5 


24.870±0.5 24.330±0.7 


26.300±3.1 


Glass 


- 36.647±0.8 


32.150±2.3 


32.150±2.3 


32.509±3.3 32.509±3.3 


29.479±10.4 


Heart 


17.704±0.2 


- 23.074±1.7 


17.037±0.6 


17.333±1.4 17.185±0.8 


16.667±8.9 


Hepatitis 


+ 15.481±0.7 


17.135±1.3 


17.135±1.3 


17.135±1.3 17.135±1,3 


19.919±8.5 


Ionosphere 


13.379±0.8 


10.025±0.9 


10.624±0.9 


11.175±1.4 11.175±1.4 


9.704±4.1 


Iris 


2.000±0.0 


- 4.333±0.8 


2.067±0.2 - 


- 3.733±0.8 2.067±0.2 


5.333±5.3 


Letter 


- 29.821±1.3 


11.880±0.6 


12.005±0.6 


11.799±1.1 11.799±1.1 


+ 9.440±0.5 


Monks- 1 


- 25.009±0.0 


10.536±1.7 


11.150±1.9 


8.752±1.9 8.729±1.9 


10.054±8.9 


Monks-2 


- 34.186±0.6 


- 32.865±0.0- 


33.907±0.4 


9.004±1.6 9.074±1.6 


27.664±20.9 


Monks- 3 


- 4.163±0.0 


+ 1.572±0.4 


3.511±0.9 


2.884±0.4 2.998±0.4 


1.364±2.4 


Mushroom 


- 3.109±0.0 


+ O.OOOiO.O + 0.062±0.0 


0.112±0.0 0.112±0.0 


0.025±0.1 


Optdigits 


- 4.687±0.1 


- 9.476±0.3 - 


- 4.732±0.1 


3.295±0.1 3.300±0.1 


- 5.429±1.4 


Pendigits 


-- 12.425±0.0 


- 3.559±0.1 - 


- 3.099±0.1 


2.890±0.1 2.890±0.1 


2.419±0.4 


Pyrimidines 


- 9.846±0.1 


+ 5.733±0.2 


6.115±0.2 


6.158±0.2 6.159±0.2 


6.175±0.9 


Satimage 


- 16.011±0.1 


- 12.894±0.2 - 


12.894±0.2 


11.776±0.3 11.776±0.3 


12.402±3.2 


Segment 


- 8.407±0.1 


3.381±0.2 


3.381±0.2 


3.190±0.2 3.190±0.2 


2.468±0.8 


Shuttle 


- 5.629±0.3 


0.028±0.0 


0.028±0.0 


0.036±0.0 0.036±0.0 


0.067±0.0 


Sonar 


24.955±1.2 


27.654±3.5 


27.654±3.5 


27.654±3.5 27.654±3.5 


22.721±9.0 


Vehicle 


22.163±0.1 


- 27.334±1.2 + 


18.282±0.5 


21.090±1.121.031±1.1 


20.900±4.6 


Votes 


- 9.739±0.2 


3.773±0.5 


3.773±0.5 


3.795±0.5 3.795±0.5 


4.172±4.0 


Waveform 


+ 14.939±0.2 


- 24.036±0.8 + 


15.216±0.2 - 


16.142±0.315.863±0.4 


- 17.241±1.4 


Wine 


1.133±0.5 


- 6.609±1.3 


1.404±0.3 


1.459±0.3 1.404±0.3 


3.830±3.6 





LB 


UT 


FT-B 


FT-T 


FT 


MS' 


Average Mean 


15.31 


14.58 


12.72 


11.89 


11.72 


12.77 


Geometric Mean 


11.63 


9.03 


7.03 


6.80 


6.63 


7.24 


Average Rank 


4.0 


4.1 


3.1 


3.3 


3.0 


3.4 


Average Ratio 


7.545 


1.41 


1.12 


1.032 


1 


1.23 


Wins/Losses 


11/19 


9/19 


13/13 


6/10 


- 


12/18 


Significant Wins/Losses 


5/15 


3/12 


5/8 


0/3 


- 


1/4 


Wilcoxon Test 


0.00 


0.00 


0.8 


0.07 


- 
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model(FT). The last column refers to the results of M5'. For each dataset, the 
algorithms are compared against the full functional tree using the Wilcoxon 
signed rank-test. A — (+) sign indicates that for this dataset the performance 
of the algorithm was worse (better) than the full model with a p value less than 
0 . 01 . 

Table 2 present a comparative summary of the results. The first two lines 
present the arithmetic and the geometric mean of the error rate across all 
datasets. The third line shows the average rank of all models, computed for 
each dataset by assigning rank 1 to the best algorithm, 2 to the second best and 
so on. The fourth line shows the average ratio of error rates. This is computed 
for each dataset as the ratio between the error rate of one algorithm and the 
error rate of the full functional tree FT. The fifth line shows the number of 
significant differences using the signed-rank test taking the multivariate tree FT 
as reference. We use the Wilcoxon Matched-Pairs Signed-Ranks Test to compare 
the error rate of pairs of algorithms across datasets. The last line shows the p 
values associated with this test for the results on all datasets and taking FT as 
reference. All the evaluation statistics shows that FT is a competitive algorithm. 
The most competitive simplified version is, again, the bottom-up version. The 
ratio of significant wins/losses between the bottom-up and top-down versions is 
10/6. It is interesting to note that the full model (FT) significantly improves 
over both components (LB and UT) in 6 datasets. 

5.3 Discussion 

The experimental evaluation points out some interesting observations: 

— For both types of problems we obtain similar rankings of the performance 
between the different versions of the algorithms. 

— All multivariate trees versions have similar performance. On these datasets, 
there is no clear winner between the different versions of functional trees. 

— Any functional tree out-performs its constituents in a large set of problems. 

In our study the results are consistent on both type of problems. Our experimen- 
tal study suggests that the full model, that is a multivariate model using linear 
functions both at decision nodes and leaves, is the most performing algorithm. 
Another dimension of analysis is the size of the model. Here we consider the 
number of leaves. This measures the number of different regions into which the 
instance space is partitioned. On this datasets, the average number of leaves for 
the univariate tree is 70. Any multivariate tree generates smaller models. The 
average number of leaves of the full model is 50, for the bottom approach is 
56, and for the top approach is 52. Nevertheless there is a computational cost 
associated with the increase in performance verified. To run all the experiments 
referred here, FT requires almost 1.7 more time than the univariate tree. 

6 Conclusions 

In this paper we have presented Functional Trees, a new formalism to construct 
multivariate trees for regression and classification problems. The proposed algo- 
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rithm is able to use functional decision nodes and functional leaf nodes. Func- 
tional decision nodes are built when growing the tree, while functional leaves are 
built when pruning the tree. A contribution of this work is that it provides a 
single framework for classification and regression multivariate trees. Functional 
trees can be seen as a generalization of multivariate trees for decision problems 
and model-trees for regression problems, allowing functional decisions both at in- 
ner and leaf nodes. We have experimentally observed that the unified framework 
is competitive against the state-of-the-art in model-trees. 

Another contribution of this work is the study about where to use decisions 
based on a combination of attributes both in regression and classification. In the 
experimental evaluation on a set of benchmark problems we have compared the 
performance of a functional tree against its components, two simplified versions 
and the state-of-the-art in multivariate trees. The results are consistent on both 
type of problems. Our experimental study suggests that the full model, that is 
a multivariate model using linear functions hath at decision nodes and leaves, is 
the most performing algorithm. Although most of the work in multivariate clas- 
sification trees follows the top-down approach, the bottom-up approach seems 
to be competitive. A similar observation applies to regression problems. This 
observation point directions for future research on this topic. 
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Abstract. We briefly summarize some of the lessons learned in a workshop on cog- 
nitive studies of science and technology. Our purpose was to assemble a diverse 
group of practitioners to discuss the latest research, identify the stumbling blocks to 
advancement in this field, and brainstorm about directions for the future. Two 
questions became central themes. First, how can we combine artificial studies in- 
volving ‘spherical horses’ with fine-grained case studies of actual practice? Results 
obtained in the laboratory may have low applicability to real world situations. Sec- 
ond, how can we deal with academics’ attachments to their theoretical frameworks? 
Academics often like to develop unique ‘toothbrushes' and are reluctant to use any- 
one else’s. The workshop illustrated that toothbrushes can be shared and that 
spherical horses and fine-grained case studies can complement one another. Theo- 
ries need to deal rigorously with the distributed character of scientific and techno- 
logical problem solving. We hope this workshop will suggest directions more so- 
phisticated theories might take. 



1 Introduction 

At the turn of the 21st century, the most valuable commodity in society is knowledge, 
particularly new knowledge that may give a culture, a company, or a laboratory an 
advantage [1-3]. Therefore, it is vital for the science and technology studies commu- 
nity to study the thinking processes that lead to discovery, new knowledge and inven- 
tion. Knowledge about these processes can enhance the probability of new and useful 
technologies, clarify the process by which new ideas are turned into marketable reali- 
ties, make it possible for us to turn students into ethical inventors and entrepreneurs, 
and facilitate the development of business strategies and social policies based on a 
genuine understanding of the creative process. 

2 A Workshop on Scientific and Technological Thinking 

In order to get access to cutting-edge research on techno-scientific thinking, Michael 
Gorman obtained funding from the National Science Foundation, the Strategic Insti- 
tute of the Boston Consulting Group and the National Collegiate Inventors and Inno- 
vators Alliance to hold a workshop at the University of Virginia from March 24-27, 
2001. With assistance from Alexandra Kincannon, Ryan Tweney and others, he as 
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sembled a diverse group of practitioners, focusing on those in the middle of their ca- 
reers and also on junior faculty and graduate students who represent the future. There 
were 29 participants, including 18 senior or mid-career researchers, and 11 junior 
faculty and graduate students. Representatives from the NSF, the Strategic Institute 
of the BCG and the NCIIA also attended. Their role was to keep participants focused 
on lessons learned, even as the participants worked to assess the state of the art and 
push heyond it, establishing new directions for research on scientific and technologi- 
cal thinking. 

In the rest of this brief paper, Gorman and Kincannon, two of the organizers of 
the workshop, and Matthew Mehalik, one of the participants, will highlight results 
from this workshop, citing the work of participants where appropriate and adding in- 
terpretive material of their own.' 

Two questions dominated in the workshop, each illustrated by a metaphor. David 
Gooding, a philosopher from the University of Bath who has done fine-grained stud- 
ies of the thinking processes of Michael Faraday, told a joke that set up one theme. In 
the joke, a multimillionaire offered a prize for predicting the outcome of a horse race 
to a stockbreeder, a geneticist, and a physicist. The stockbreeder said there were too 
many variables, the geneticist could not make a prediction about any particular horse, 
but the physicist claimed the prize, saying he could make the prediction to many 
decimal places — provided it were a perfectly spherical horse moving through a vac- 
uum. This metaphor led to a question: How can we combine artificial studies in- 
volving ‘spherical horses’ and fine-grained case studies of actual practice? Results 
obtained under rigorous laboratory conditions may have what psychologists call low 
ecological validity, or low applicability to real world situations [4]. Highly abstract 
computational models often ignore the way in which real-world knowledge is embed- 
ded in social contexts and embodied in hands-on practices [5]. 

The second metaphor came from Christian Schunn, then at George Mason Uni- 
versity and now at the University of Pittsburgh, who noted that taxonomies and 
frameworks are like toothbrushes — no one wants to use anyone else’s. This metaphor 
led to another question: How can we transcend academics’ attachments to their indi- 
vidual theoretical frameworks? Academic psychologists, historians, sociologists and 
philosophers like to develop and refine unique toothbrushes and are reluctant to use 
anyone else’s. Real-world practitioners are not as fussy; they are willing to assemble a 
‘bricolage’ of elements from various frameworks that academics might regard as in- 
commensurable. 



3 A Moratorium against Spherical Horses? 

Nancy Nersessian, a philosopher and cognitive scientist from the Georgia Institute of 
Technology, reminded participants that Bruno Latour declared a ten-year moratorium 
against cognitive studies of science in 1986. Latour was one of the key figures in 
promoting a new sociology of scientific knowledge. He and others were reacting 



‘ The views reflected here are those of the authors, and have not been endorsed by workshop 
participants, the NSF, BCG or the NCIIA. All participants were taped, with their consent, 
and we have used these tapes in an effort to reconstruct highlights. Thanks to Pat Langley for 
his comments on a draft. 
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against the idea that science was a purely rational enterprise, carried out in an abstract 
cognitive space. 

Cognitive scientists like Herbert Simon contributed to this abstract cognizer view 
of science.^ Simon was one of the founders of a movement Nersessian labeled “Good 
Old Fashioned Artificial Intelligence” (GOFAI). Simon’s toothbrush, or framework, 
began with the assumption that there is nothing particularly unique about what a Kep- 
ler does — the same thinking processes are used on both ordinary and extraordinary 
problems [6], Simon was a revolutionary in the Kuhnian sense; he played a major 
role in creating artificial intelligence and linking it with a new science of thinking, 
called cognitive science. 

Peter Slezak used programs like BACON to turn the tables on Latour’s morato- 
rium: 

A decisive and sufficient refutation of the 'strong programme' in the sociology of 
scientific knowledge (SSK) would be the demonstration of a case in which sci- 
entific discovery is totally isolated from all social or cultural factors whatever. I 
want to discuss examples where precisely this circumstance prevails concerning 
the discovery of fundamental laws of the first importance in science. The work I 
will describe involves computer programs being developed in the burgeoning 
interdisciplinary field of cognitive science, and specifically within 'artificial in- 
telligence' (AI). The claim I wish to advance is that these programs constitute a 
'pure' or socially uncontaminated instance of inductive inference, and are capable 
of autonomously deriving classical scientific laws from the raw observational 
data [7, pp. 563-564]. 

Slezak argued that if programs like BACON [8, 9] can discover, then there is no 
need to invoke all these interests and negotiations the sociologists use to explain dis- 
covery. His claims sparked a vigorous debate in the November, 1989 issue of the 
journal Social Studies of Science. Latour and Slezak illustrate how academics can 
create almost incommensurable frameworks. If the Simon perspective is a tooth- 
brush, then Latour is denying that it even exists — and vice versa. 

Nersessian reminded participants that, had Simon been at the workshop, he would 
have argued that his toothbrush does incorporate the social and cultural; it is just that 
all of this is represented symbolically in memory [10]. Therefore, cognition is about 
symbol processing. These symbols could be as easily instantiated in a computer as in 
a brain. 

In contrast, Greeno and others advocate a position whose roots might be traced to 
Gibson and Dewey: that knowledge emerges from the interaction between the indi- 
vidual and the situation [11]. Cognition is distributed in the environment as well as 
the brain, and is shared among individuals [12, 13], Merlin Donald discusses the role 
of culture in the evolution of cognition [14]. Nersessian in her own work explored 
how cultural factors can account for differences between the problem-solving ap- 
proaches of scientists like Maxwell and Ampere [15]. 

In the symbol-processing view, discovery and invention are merely aspects of a 
general problem-solving system that can best be represented at the ‘spherical horse’ 
level. In the situated and distributed view, discovery and invention are practices that 



^ Simon intended to be a participant in our workshop, but died shortly before it — a great trag- 
edy and a great loss. During the planning stages, he referred to this as a workshop of ‘right 
thinkers’. For tributes to him, see 

http://www.people.virginia.edu/~apk5t/STweb/mainST.html . 
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need to be studied in their social context. This situated cognition perspective comes 
much closer to that of sociologists and anthropologists of science [16], hut advocates 
like Norman and Hutchins still talk about the importance of representations like 
mental models. 

Jim Davies, from the Georgia Institute of Technology, applied Nersessian’s cog- 
nitive-historical approach to a case study of the use of visual analogy in scientific 
discovery. Davies analyzed the process of conceptual change in Maxwell’s work on 
electromagnetism and applied to it a model of visual analogical problem solving 
called Galatea. He found that visual analogy played an important role in the devel- 
opment of Maxwell’s theories and demonstrated that the cognitive-historical approach 
is useful for understanding general cognitive processes. 

Ryan Tweney, a co-organizer of the workshop, described his own in vivo case 
study of Michael Faraday’s work on the interaction of light and gold films [33]. 
Tweney is in the process of replicating these experiments to unpack the tacit knowl- 
edge that is embodied in the cognitive artifacts created by Faraday. He hopes to do a 
kind of material protocol analysis that goes beyond the verbal material that is in Fara- 
day’ s diary. One end result might he a digital version of Faraday’s diary that includes 
images and perhaps even QuickTime movies of replications of his experiments. This 
kind of study potentially bridges the gap between situated and symbolic studies of 
discovery. 

4 A Common Set of Toothbrushes? 

David Klahr, a cognitive psychologist at Carnegie Mellon, has shown a preference for 
spherical horses, conducting experiments on scientific thinking. However, his ex- 
periments have used sophisticated, complex tasks. For example, he and two of his 
students (also workshop participants) Kevin Dunbar and Jeff Shrager asked partici- 
pants in a series of experiments to program a device called the Big Trak, and studied 
the processes they used to solve this problem. The Big Trak was a battery-powered 
vehicle that could be programmed, via a keypad, to move according to instructions. 
One of the keys was labeled RPT. Participants had to discover its function. 

Following in Herb Simon’s footsteps, Klahr, with Dunbar and Schunn, character- 
ized subjects’ performance as a search in two problem spaces, one occupied by possi- 
ble experiments, the other by hypotheses [17]. They found that one group of subjects 
(Theorists) preferred to work in the hypothesis space, proposing about half as many 
experiments as the second group (Experimenters). Almost all of the former's experi- 
ments were guided by a hypothesis, whereas the latter's were often simply explora- 
tory. 

Based on this and other work, Klahr proposed a possible general framework, or 
shareable toothbrush, for classifying the different kinds of cognitive studies. This 
general framework is based on multiple problem spaces, and whether the study was a 
general one, using an abstract task like the Big Trak, or domain-specific, like Nerses- 
sian’s studies of Maxwell [18]. 

Dunbar, currently at McGill University and moving to Dartmouth in the fall, 
added to this general framework the idea of classifying experiments based on whether 
they were in vitro (controlled laboratory experiments) or in vivo (case studies). Com- 
putational simulations can be based on either in vivo or in vitro studies. A system for 
classifying studies of scientific discovery might begin with a 2x2 matrix. Big Trak is 
an example of an in vitro technique; the work on Maxwell described by Nersessian, 
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on Faraday by Tweney, and on nuclear fission by Andersen, are examples of in vivo 
work. The three in vivo research programs did not explicitly distinguish between hy- 
pothesis and experiment spaces, but the practitioners studied generated both hypothe- 
ses and experiments. 

The rest of this paper will feature highlights from the workshop that will force us 
to expand and transform this classification scheme (see Table I). Dunbar’s work has 
iterated between in vivo and in vitro studies. The value of in vitro work is the way in 
which it allows for control and isolation of factors — like the way in which the possi- 
bility of error encourages experimental participants to adopt a confirmatory heuristic 

[19]. 

Dunbar thinks it is important to compare such findings with what scientists actu- 
ally do. He has conducted a series of in vivo studies of molecular biology laborato- 
ries [20, 21]. Group studies have the heuristic value of forcing people to explain their 
reasoning. Regarding error, the molecular biologists had evolved special controls to 
check each step in a complex procedure in order to eliminate error. Dunbar ran an in 
vitro study in which he found that undergraduate molecular biology students would 
also employ this kind of control on a task that simulated the kind of reasoning used in 
molecular biology [22]. Dunbar’s work shows the importance of iterating between in 
vitro and in vivo studies. 

Schunn and his colleagues were interested in how scientists deal with unexpected 
results, or anomalies. In one study, he videotaped two astronomers interacting over a 
new set of data concerning the formation of ring galaxies. Schunn found that these 
researchers noticed anomalies as much as expected results, but paid more attention to 
the anomalies. The researchers developed hypotheses about the anomalies and elabo- 
rated on them visually, whereas they used theory to elaborate on expected results. 
When the two astronomers discussed the anomalies, they used terms like ‘the funky 
thing’ and ‘the dipsy-doodle’, staying at a perceptual rather than a theoretical level. 
Schunn’ s astronomers were working neither in the hypothesis nor experimental space; 
instead, they were working in a space of possible visualizations dependent on their 
domain-specific experience. 

Hanne Andersen, from the University of Copenhagen, described the use of a fam- 
ily resemblance view of taxonomic concepts for understanding the dynamics of con- 
ceptual change. She noted that the family resemblance account has been criticized for 
not being able to distinguish sufficiently between different concepts, the problem of 
wide-open texture. This limitation could be resolved by including dissimilarity as 
well as similarity between concepts and by focusing on taxonomies instead of indi- 
vidual concepts. Anomalies can be viewed as violations of taxonomic principles that 
then lead to conceptual change. Andersen applied this approach to the discovery of 
nuclear fission, finding that early models of disintegration and atomic structure were 
revised in light of anomalous experimental results of this taxonomic kind 

Shrager, affiliated with the Department of Plant Biology, Carnegie Institution of 
Washington, and the Institute for the Study of Learning and Expertise, did a reflective 
study of his own socialization into phytoplankton molecular biology. In the begin- 
ning, he had to be told about every step, even when there were explicit instructions; 
he needed an extensive apprenticeship. As his knowledge grew, he noted that it was 
“somewhere between his head and his hands.” As his skill developed, he was able to 
take some of his attention off the immediate task at hand and understand the purpose 
of the procedures he was using. On at least one occasion, this came together in the 
“blink of an eye.” The cognitive framework he found most useful was his own tooth- 
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brush: view application [23]. To his surprise, Shrager found that, “What passes for 
theory in molecular biology is the same thing that passes for a manual in car mechan- 
ics.” He found less of a need to keep reflective notes in his diary as he became more 
proficient, though he continued to record the details of experiments, where particular 
materials were stored and all the other procedural details that are vital to a molecular 
biologist. He commented that, “if you lose your lab notebook, you’re hosed.” 

Gooding indicated that more abstract computational models of the spherical horse 
variety have not worked well for him. For him, “the beauty is in the dirt.” In collabo- 
ration with Tom Addis, a computer scientist, he evolved a detailed, computational 
scheme for representing Faraday’s experiments, hypotheses and construals [24]. 
Gooding thought that communication ought to be added to the matrix proposed by 
Klahr and Dunbar (See Table 1). 

Paul Thagard, from the University of Waterloo, has been gathering ideas from 
leaders in the field about what it takes to be a successful scientist. According to Herb 
Simon, one should not work on what everyone else is working on and one needs to 
have a secret weapon, in his case, computational modeling. As part of a case study, 
Thagard interviewed a microbiologist, Patrick Lee, who accidentally discovered that a 
common virus has potential as a treatment for cancer. The discovery was the result of 
a “stupid” experiment in viral replication done by one of Lee’s graduate students. 
The “stupid” experiment produced an anomalous result that eventually led to the gen- 
eration of a new hypothesis about the virus’ ability to kill cancer cells. This chain of 
events is an example of abductive hypothesis formation, in which hypotheses are gen- 
erated and evaluated in order to explain data. Once a hypothesis was generated that 
fit the data, researchers used deduction to arrive at the hypothesis that the virus could 
kill cancer cells. Thagard raises the questions of how one decides what experiments 
to do and how one determines what is a good experiment. These questions are a criti- 
cal part of the cognitive processes involved in discovery. Thagard is also looking at 
the role of emotions in scientific inquiry, in judgments about potential experiments, in 
reactions to unexpected results, and in reactions to successful experiments (Thagard’ s 
model of emotions and science: http://cogsci/uwaterloo.ca ). Thagard suggested add- 
ing a space of questions to the Klahr framework. 

Robert Rosenwein, a sociologist at Lehigh, presented an in vitro simulation of sci- 
ence (SCISIM) that comes close to an in vivo environment [25]. Students in a class 
like Gorman’s Scientific and Technological Thinking (http://128.143.168.25/classes 
/200R/tcc200rf00.html) take on a variety of social roles in science. Some work in 
competing labs, others run funding agencies, still others run a journal and a newslet- 
ter. The students in the labs try to get funding for their experiments, and then publish 
the results. They do not do the kinds of fine-grained experimental processes done by 
participants in Big Trak; instead, they choose the variables they want to combine in an 
experiment, select a level of precision, and are given a result. Experiments cost ‘sim- 
bucks’ and salaries have to be paid, so there is continual pressure to fund the lab. 
There is a group of independent scientists as well, who have to decide which line of 
research to pursue. SCISIM adds another column to the matrix, for simulation of pur- 
suit decisions. Pursuit decisions concern which research program to seek funding for 
(See Table 1). 

Such decisions are usually made within a network of enterprises. Marin Simina, a 
cognitive scientist at Tulane, described a computational simulation of Alexander Gra- 
ham Bell’s network of enterprises. Howard Gruber coined the term ‘network of enter- 
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prises’ to describe the way in which Darwin pursued multiple projects that eventually 
played a synergistic role in his theory of evolution [26]. Similarly, Alexander Graham 
Bell had two major enterprises in 1873: making speech visible to the deaf, and send- 
ing multiple messages down a single wire. These enterprises were synthesized in his 
patent for a speaking telegraph, which focused on the type of current that would have 
to be used to transmit and receive speech [27, 28]. 

Simina created a program called ALEC, which simulated the discovery Bell made 
on June 2, 1875. At that time. Bell’s primary goal was to reach fame and fortune by 
solving the problem of multiple telegraphy, Bell had suspended the goal of transmit- 
ting speech because his mental model for a transmitter contained an indefinite number 
of metal reeds — it was not clear how it could be built. On June 2, 1877, a single tuned 
reed transmitted multiple tones with sufficient volume to serve as a transmitter for the 
human voice. Bell was not seeking this result; he wanted the reed to transmit only a 
single tone. But this serendipitous result allowed him to activate his suspended goal 
and instruct Watson to build the first telephone [29]. ALEC was able to simulate the 
process of suspending the goal and how Bell was primed to reactivate it by a result. 

5 Collaboration and Invention 

Gary Bradshaw, a cognitive scientist at Mississippi State and a collaborator with Herb 
Simon, talked about “stepping off Herb’s shoulders into his shadow.” In a study of 
the Wright Brothers, he adapted Klahr’s framework to invention, creating three 
spaces: function, hypothesis and design [30]. One of the major reasons the Wrights 
succeeded where others failed was that the brothers decomposed the problem into 
separate functions — like vertical lift, horizontal stability, and turning. Other inventors 
worked primarily in a design space, adding features like additional wings without the 
careful functional analysis done by the Wrights. This suggests that function and de- 
sign spaces ought to be added for inventors (see Table 1). 

To see how well his framework of invention work-spaces held up, Bradshaw tried 
another case — the rocket boys from West Virginia, immortalized in a book by Homer 
Hickam [31], and in the film October Sky [32]. Their problem of rocket construction 
could be decomposed into multiple spaces, but a complete factorial of all the possible 
variations would come close to two million cells, so they could not follow the strategy 
called Vary One Thing at a Time (VOTAT) — they did not have the resources. Al- 
though the elements of the rocket construction were not completely separable, they 
tested some variables in isolation, such as fuel mixtures in bottles. They also did care- 
ful post-launch inspection, and used theory to reduce the problem space; for example, 
they used calculus to derive their nozzle shape. They built knowledge as they went 
along, taking good notes. Team members also took different roles — one was more of 
a scientist, another more of an engineer and project manager. 

Tweney argued from his own experience that the rocket system was much less 
decomposable than suggested by Bradshaw’s analysis and that the West Virginia 
group seemed to hit upon some serendipitous decompositions. Tweney’ s rocket group 
was stronger in chemistry, so they used theory to create the fuel, and copied the noz- 
zle design. Both were post-Sputnik groups active during the late 1950’s, although 
Tweney insists that his was a less serious “rocket boy” group than the one studied by 
Hickam. 
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Mehalik, a Systems Engineer at the University of Virginia, developed a framework 
which combined Hutchins’ analysis of distributed cognition ‘in the wild’ [12], with 
three states or stages in actor networks. 

1. A top-down state in which one actor or group of actors controls the research 
program and tells others what to do. 

2. A trading zone state in which no group of actors has a comprehensive view, 
but all are connected by a boundary object that each sees differently. Peter 
Galison uses particle detectors as an example of this sort of boundary object 
[34], 

3. A shared representation state in which all actors have a common perspective 
on what needs to be accomplished, even if there is still some division of la- 
bor based on skills, aptitude and expertise. 

Mehalik applied this framework to the invention of an environmentally sustainable 
furniture fabric by a global group. This network began with a shared mental model 
based on an analogy to nature, then struggled to settle into a stable trading zone in 
which participants would trade economic benefits and prestige. The resulting fabric 
has won almost a dozen major environmental awards and is seen as a leading example 
of innovative environmental design. 

Klahr suggested that Mehalik’ s research might add another dimension to his over- 
all framework: capturing work in groups and teams. It might be possible to take each 
of the major actants studied by Mehalik, look at what spaces they worked in, then 
show links between them and their different activities. Tweney raised an important 
question about distributed cognition — could intra-individual cognition be modeled in 
a way similar to inter-individual cognition by including the three-state framework? 

Michael Hertz, from the University of Virginia, developed a tool for determining 
causal attribution, and applied it to Monsanto’s initially unsuccessful introduction of 
GMO’s into Europe. The tool did not allow Hertz to identify a primary cause, but it 
did reduce the complexity of the decision space for students studying the Monsanto 
case and trying to determine who or what was at fault. Shrager suggested implement- 
ing this tool in an Echo network that would incorporate interaction with the decision- 
makers themselves. Bernie Carlson raised the question of when it is useful to quan- 
tify certain decision situations, again relating to the theme of the balance between 
using a tool to help reduce complexity in a decision situation while still maintaining 
contextual validity. Ryan Tweney raised the issue of using Hertz’s framework in a 
predictive sense — the dynamic complexity of the situation may be too difficult to 
make predictions; however, prediction is what a company such as Monsanto may be 
most interested in. Hertz responded by saying that the act of trying to identify causes 
has heuristic value, especially if a tool helps Monsanto distinguish between the rela- 
tive role of factors it can influence and factors that are largely beyond its control. De- 
cision aids and simulations simplify complex situations; decision-makers need to re- 
member that these simplifications may not accurately reflect all important aspects of 
the underlying situation, including complex, dynamic interactions among variables. 

Thomas Hughes, a historian of technology, talked about his analysis of collective 
invention in large-scale systems like the development of the Atlas and Polaris missiles 
[35]. He extolled the virtues of systems management techniques and the benefits of 
isolating scientists from bureaucracy. Project management and oversight functions 
change with the size of the group and management becomes more explicitly needed 
with larger groups. Without sufficient oversight, large projects can be too diffuse and 
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inefficient. Dunbar suggested that having this kind of systems management was one 
reason why the privately funded Celera outperformed the publicly funded Human 
Genome Project. 

William (Chip) Levy, from the Department of Neurosurgery at the University of 
Virginia, described a neural network that models results of an implicit learning ex- 
periment. He uses the model as an illustration of how variability can be an adaptive 
property in biological terms. Complex systems, like brains and like neural network 
models, benefit from the random fluctuations of noise. Eliminating variability in these 
systems would sacrifice too much memory capacity. Variability exists both within 
and between individuals. 

Levy’s research highlights the role of tacit knowledge in discovery and invention. 
Sociologists of science and technology emphasize the tacit dimension [36, 37]. There 
is a growing cognitive literature on implicit knowledge in psychology [38, 39], but 
this literature does not connect directly to discovery and invention. Several confer- 
ence participants mentioned tacit knowledge. Robert Matthews, a cognitive psycholo- 
gist at Louisiana State and one of the leading researchers on implicit learning [40], 
predicted that Dunbar’s scientists would be unable to explain why they did what they 
did. Dunbar responded that the scientists’ after-the-fact stories about how they did 
what they did had nothing to do with their actual processes. Schunn noted Karmiloff- 
Smith’s three stages of learning, in which the second stage means you can do some- 
thing without being able to explain it, and the third stage involves reflection [41]. The 
way to become aware of one’s implicit knowledge is to watch oneself, which can in- 
terfere with performance. 

Maria Ippolito, from the University of Alaska, compared the creative process ex- 
hibited in the writings of Virginia Woolf to that used by scientists. Ippolito offered 
Woolf as an example of a scientific thinker in a more general sense and constructed a 
multi-dimensional database using Woolf’s writings. Through the examination of 
Woolf’s development as a writer, Ippolito investigated the psychological processes of 
creative problem solving, including heuristics, scripts and schemata, development of 
expertise, and search of unstructured problem spaces. 

Elke Kurz, from the University of Tubingen, commented on two studies in which 
she observed the softening of often-perceived boundaries between cognitive-historical 
case study analysis and in-laboratory analyses. She examined how scientists and 
mathematicians used different representational systems, such as variant forms of Cal- 
culus, when problem solving. These differences can be traced to historical develop- 
ments in the different scientific fields. Such historical developments invite historical 
case analysis as a necessary part of the study of the conceptual resources these differ- 
ent scientists possessed. Kurz also replicated experiments involving perception of 
size constancy that had been done earlier by Brunswik. During the attempts at repli- 
cation, Kurz noted how Brunswik needed to constrain the participants’ agency into 
forms that Brunswik found tolerable in the context of his experiment. Kurz stated the 
construction of this context of acceptable agency is a process worth studying using 
historical case methods, again complementing the in-laboratory style of investigation. 
Einally, Kurz reported on the difficulties of attempting a replication of a previous ex- 
periment because of the changes in many contextual events between the original ex- 
periment and the replicated experiment. This situation again invites the crossing of 
any perceived boundary between the case study and in-laboratory approaches. 
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6 Lessons Learned 

The workshop illustrates that toothbrushes can be shared. The example we used in 
this paper was the Simon/Klahr multiple spaces framework. Table 1 summarizes the 
potential spaces identified in the workshop. 



Table 1. Different search spaces identified by participants in the workshop. Asterisks denote 
computational simulations, a kind of ‘spherical horse’ that can be based on either in vivo or in 
vitro studies. Italics denote spaces that are unique to invention. 



Search Spaces 


In Vitro 


In Vivo 


Hypotheses 


Big Trak, SciSim 


Maxwell, Earaday 


Experiments 


Big Trak, SciSim 


Maxwell, Faraday 


Pursuit 


SciSim 


ALEC* 


Communication 
Embodied knowledge 
Taxonomies 
Visualizations 
Questions 

Links in a social network 

Function 

Design 


SciSim 


Faraday 

Faraday, Shrager 
Nuclear fission 

Galatea*, Schunn’s astronomers 
Patrick Lee 
Hughes, Mehalik 
Wright brothers, rocket boys 
Wright brothers, rocket boys 



The problem with this framework is that each study seemed to suggest the need 
for yet another space. There is not always a clear line of demarcation between spaces. 
For example, SciSim incorporates in vivo cases, which means that it can exist in a 
kind of gray zone between in vitro and in vivo. Visualizations can be thought ex- 
periments, ways of seeing the data, and mental models of a device or even of a social 
network. Despite its shortcomings, this framework has heuristic value, both for or- 
ganizing research already done and for suggesting directions for future work. For 
example, only Bradshaw has worked with function and design spaces, and there is no 
in vitro work on invention 

Mehalik’s work demonstrated the need for mapping movements among spaces 
across individuals over time. What would happen if we added time-scale to the 
framework? Schun suggested that visualizations happen most repidly, with experi- 
ments and hypotheses taking longer, and taxonomies even longer.^ Hughes and Me- 
halik remind us that time-scale is partly dependent on the extent to which each of 
these activities depends on network-building. 

This framework is also general enough to facilitate comparisions between discov- 
ery, invention and artistic creation, as Ippolito noted. More comparisions of this sort 
are needed. 



^ Personal communication. 
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7 Future of Cognitive Studies of Science and Technology 



Bruce Seely, a historian of technology on rotation at the NSF’s Science and Technol- 
ogy Studies program, felt that the workshop showed how cognitive studies of science 
and technology had grown in sophistication, highlighting the creators of new knowl- 
edge in ways that complemented studies of users by other STS disciplines. 

Tiha von Ghyczy, representing the Strategic Institute of the Boston Consulting 
Group, noted that managers are happy to use any toothbrush that will help them im- 
prove their business strategies, and they are also more concerned about practical re- 
sults than methodological foundations. Still, he felt that managers would find lessons 
from the workshop interesting. Strategies have a very short half-life; a successful 
strategy is quickly imitated by competitors. Therefore, original thinking is essential 
for business survival. 

Besides business strategy and science-technology studies, a cognitive approach to 
invention and discovery should also inform work in ‘mainstream’ cognitive science. 
Theories and frameworks need to be able to deal in a rigorous way with the shared 
and distributed character of scientific and technological problem solving, and also its 
tacit dimension. We hope this workshop will suggest the outlines more sophisticated 
theories and models might take. Ideally, anyone doing a computational model or deci- 
sion-aid for discovery would base it on one or more fine-grained case studies. Tweney 
and Dunbar have had particularly good success combining in vitro and in vivo ap- 
proaches. We hope this workshop will encourage more collaborations between those 
trained in spherical-horse approaches and those capable of going deeply into the de- 
tails of particular discoveries and inventions. 
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Abstract. Generating press clippings for companies manually requires 
a considerable amount of resources. We describe a system that moni- 
tors online newspapers and discussion boards automatically. The system 
extracts, classifies and analyzes messages and generates press clippings 
automatically, taking the specific needs of client companies into account. 
Key components of the system are a spider, an information extraction 
engine, a text classifier based on the Support Vector Machine that cate- 
gorizes messages by subject, and a second classifier that analyzes which 
emotional state the author of a newsgroup posting was likely to be in. 
By analyzing large amount of messages, the system can summarize the 
main issues that are being reported on for given business sectors, and can 
summarize the emotional attitude of customers and shareholders towards 
companies. 



1 Introduction 

Monitoring news paper or journal articles, or postings to discussion boards is an 
extremely laborious task when carried out manually. Press clipping agencies em- 
ploy thousands of personnel in order to satisfy their clients’ demand for timely 
and reliable delivery of publications that relate to their own company, to their 
competitors, or to the relevant markets. The internet presence of most publi- 
cations offers the possibility of automating this filtering and analyzing process. 
One challenge that arises is to analyze the content of a news story well enough to 
judge its relevance for a given client. A second difficulty is to provide appropriate 
overview and analyzing functionality that allows a user to keep track of the key 
content of a potentially huge amount of relevant publications. 

Software systems that spider the web in search of relevant information, and 
extract and process found information are usually referred to as information 
agents [16,2]. They are being used, for instance, to find interesting web sites 
or links [19,12], or to filter news group postings {e.g., [26]). One attribute of 
information agents is how they determine the relevance of a document to a user. 

Gontent-based recommendation systems {e.g., [1]) judge the interestingness 
of a document to the user based on the content of other documents that the user 
has found interesting. By contrast, collaborative filtering approaches {e.g., [13]), 
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draw conclusions based on which documents other users with similar preferences 
have found interesting. In many applications, it is not reasonable to ask the 
user to elaborate his or her preferences explicitly. Therefore, information agents 
often try to learn a function that expresses user interest from user feedback; e.g., 
[26,18]. By contrast, a user who approaches a press clipping agency usually has 
specific, elaborated information needs. 

The problem of identifying predetermined relevant information in text or 
hypertext documents from some specific domain is usually referred to as in- 
formation extraction (IE) {e.g., [4,3]). In the news clipping context, several in- 
stances of the information extraction problem occur. Firstly, press articles have 
to be extracted from HTML pages where they are usually embedded between 
link collections, adverts, and other surrounding text. Secondly, named entities 
such as companies or products have to be identified and extracted and, thirdly, 
meta-information such as publication dates or publishers need to be found. 

While first IE algorithms were hand-crafted sets of rules {e.g., [7]), algorithms 
that learn extraction rules from hand-labeled documents {e.g., [8,14,6]) have 
now become standard. Unfortunately, rule-based approaches sometimes fail to 
provide the necessary robustness against the inherent variability of document 
structure, which has led to the recent interest in the use of Hidden Markov 
Models (HMMs) [25,17,21,23] for this purpose. 

In order to identify whether the content of a document matches one of the cat- 
egories the user is interested in and to summarize the subjects of large amounts 
of relevant documents, classifiers that are learned from hand-labeled documents 
{e.g., [24,11]) provide a means of categorizing a document’s content that reaches 
far beyond key word search. Furthermore, it can be interesting to determine the 
emotional state [9] of authors of postings about a company or product. 

In this paper, we discuss a press clipping information agent that downloads 
news stories from selected news sources, classifies the messages by subject and 
business sector, and recognizes company names. It then generates customized 
clippings that match the requirement of clients. We describe the general archi- 
tecture in Section 2, and discuss the machine learning algorithms involved in 
Section 3. Section 4 concludes. 



2 Publication Monitoring System 

Figure 1 sketches the general architecture of the system. A user can configure 
the information service by providing a set of preferences. These include the 
names of all companies that he or she would like to monitor, as well as all 
business areas {e.g., biotechnology, computer hardware) of interest The spider 
cyclically downloads a set of newspapers, journals, and discussion boards. The 
set of news sources is fixed in advance and not depending on the users’ choices. 
All downloaded messages are recorded in a news database after the extraction 
engine has stripped the HTML code in which the message is embedded (header 
and footer parts as well as HTML tags, pictures, and advertisements). 
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Fig. 1. Overview of the SemanticEdge Publication Monitoring System 



The spider developed by SemanticEdge is configured by providing a set of 
patterns which all URLs that are to be downloaded have to match. Typically, 
online issues of newspapers have a fairly fixed site structure and only vary the 
dates and story numbers in the URLs daily. Depending on the difficulty of the 
site structure, configuring the spider such that all current news stories but no 
advertisements, archives, or documents that do not directly belong to the news- 
paper are downloaded, requires between one and four hours. 

Text classifier, named entity recognizer, and emotional analyzer operate on 
this database. The text classifier categorizes all news stories and newsgroup 
postings whereas the emotional analyzer is only used for newsgroup postings; it 
classifies the emotional state that a message was likely to be written in. For each 
client company, a customized press clipping is generated, including summariza- 
tion and visualization functionality. 

The press clipping consists of a set of dynamically generated web pages that a 
user can view in a browser after providing a password. The system visualizes the 
number of publications by source, by subject, and by referred company. For each 
entry, an emotional score between zero (very negative) and one (very positive) 
is visualized as a red or green bar, indicating the attitude of the article (Fig. 2), 
or the set of summarized articles. Figure 2 shows the list of all articles relevant 
to a client. Figure 3 shows the summary mode in which the system summarizes 
all articles either from one news source, or about one company, or related to one 
business sector per line. The average positive or negative attitude of the articles 
summarized in one line is visualized by a red or green bar. 

Several diagrams visualize the frequency of referrals to business sectors or 
individual companies and the average expressed attitude (Figure 4). 



3 Intelligent Document Analysis 



Document analysis consists of information extraction (including recognition of 
named entities), subject classification, and emotional state analysis. 
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Fig. 2. Press clipping for client company: message overview 
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Fig. 3. Press clipping for client company: company summary 
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Fig. 4. Top: frequency of messages related to business sectors. Bottom: expressed emo- 
tional attitude toward companies 
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3.1 Information Extraction 

Two main paradigms of information extraction agents which can be trained from 
hand-labeled documents exist; algorithms that learn extraction rules {e.g., [8,14, 
6]) and statistical approaches such as Markov models [25,17], partially hidden 
Markov models [21,23] and conditional random fields [15]. 

Rule base information extraction algorithms appear to be particularly suited 
to extract text from pages with a very strict structure and little variability 
between documents. In order to learn how to extract the text body from the 
HTML page of a Yahoo! message board, the proprietary rule learner that we use 
needs only one example in order to identify where, in the document structure, 
the information to be extracted is located. We can then extract text bodies from 
other messages with equal HTML structure with an accuracy of 100%. 

Many other information extraction tasks, such as recognizing company 
names, or stock recommendations, rule based learners do not provide enough 
robustness to deal with the high variability of natural language. Hidden Markov 
models (HMMs) (see, [20] for an introduction) are a very robust statistical 
method for analysis of temporal data. An HMM consists of finitely many states 
{S'!, . . . , S'iv} with probabilities = P{qi = Si), the probability of starting 
in state Si, and = P{qt+i = Sj\qt = Si), the probability of a transition 
from state Si to Sj. Each state is characterized by a probability distribution 
bi{Ot) = P{Ot\qt = Si) over observations. In the information extraction con- 
text, an observation is typically a token. The information items to be extracted 
correspond to the n target states of the HMM. Background tokens without label 
are emitted in all HMM states which are not one of the target states. 

HMM parameters can be learned from data using the Baum- Welch algorithm. 
When the HMM parameters are given, then the model can be used to extract 
information from a new document. Firstly, the document has to be transformed 
into a sequence of tokens; for each token, several attributes are determined, 
including the word stem, part of speech, the HTML context, attributes that 
indicates whether the word contains letter, digits, starts with a capital letter 
and other attributes. Thus, the document is transformed into a sequence of 
attribute vectors. 

Secondly, the forward-backward algorithm [20] is used to determine, for each 
token, the most likely state of the HMM that it was emitted in. If, for a given 
token, the most likely state is one of the background states, then this token 
can be ignored. If the most likely state is one of the target states and thus 
corresponds to one of the items to be extracted, then the token is extracted and 
copied into the corresponding database field. 

In order to adapt the HMM parameter, a user first has to label informa- 
tion to be extracted manually in the example document. Such partially labeled 
documents form the input to the learning algorithm which then generates the 
HMM parameters. We use a variant of the Baum- Welch algorithm [23] to find 
the model parameters which are most likely to produce the given documents and 
are consistent with the labels added by the user. 
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Figure 5 shows the GUI of the SemanticEdge information extraction envi- 
ronment. HMM based and rule based learners are plugged into the system. 




Fig. 5. GUI of the information extraction engine 



For specialized information extraction tasks such as finding company names 
in news stories, specially tailored information extraction agents outperform more 
general approaches such as HMMs. For instance, most companies that are being 
reported about are listed at some stock exchange. To recognize these companies, 
we only need to maintain a dynamically growing database. 



3.2 Subject Classification 

For the subject classification step, we have defined a set of message subject 
categories {e.g., IPO announcement, ad hoc message) and a set of business sector 
and markets categories. The resulting classifiers assign each message a set of 
relevant subjects and sectors. 

The classifier proceeds in several steps. First, a text is tokenized and the 
resulting tokens are mapped to their word stems. We then count, for each word 
stem and each example text, how often that word occurs in the text. We thus 
transform each text into a feature vector, treating a text as a bag of words. 
Finally, we weight each feature by the inverse frequency of the corresponding 
word which has generally been observed to increase the accuracy of the resulting 
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classifiers {e.g., [10,22]). This procedure maps each text to a point in a high- 
dimensional space. 

The Support Vector Machine (SVM) [11] is then used to efficiently find a 
hyper-plane which separates positive from negative examples, such that the mar- 
gin between any example and the plane is maximized. For each category we thus 
obtain a classifier which can then take a new text and map it to a negative or 
positive values, measuring the document’s relevance for the category. 

During the application phase, the support vector machine returns, for each 
category, a value of its decision function, that can range from large negative to 
large positive values. It is necessary to define a threshold value from which on 
a document is considered to belong to the corresponding category. There are 
several criteria by which this threshold can be set; perhaps the most popular is 
the precision recall breakeven point. The precision quantifies the probability of 
a document really belonging to a class given that it is predicted to lie in that 
class. Recall, on the other hand quantifies how likely it is that a document really 
belonging to a category is in fact predicted to be a member of that class by the 
classifier. By lowering the threshold value of the decision function we can increase 
recall and decrease precision, and vice versa. The point at which precision equals 
is often used as a normalized measure of the performance of classification and 
IR methods. Varying the threshold leads to precision and recall curves. Figure 
6 shows the GUI of our text SVM-based categorization tool. 



MofKt uooctm Procwtm hc^ 




Fig. 6. GUI of the text classification engine 
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It is also possible to define the accuracy (the probability of the classifier 
making a correct prediction for a new document) as a performance measure. 
Unfortunately, many categories (such as IPO announcement) are so infrequent, 
that a classifier which in fact never predicts that a document does belong to 
this class can achieve an accuracy of as much as 99.9%. This renders the use of 
accuracy as a performance metric less suited than precision/recall curves. 

For each category, we manually selected about 3000 examples, between 60 
and 700 of these examples were positives. Figure 7 shows precision, recall, and 
accuracy of some randomly selected classes over the threshold value. The curves 
are based on hold-out testing on 20% of the data. Note that, for many of these 
classes such as xxx, the prior ratio of positive examples is extremely small (such 
as 1.4%). Specialized categories, such as “initial public offering announcement” 
can be recognized almost without error; “fuzzy” concepts like “positive marked 
news” impose greater uncertainties. 

3.3 Emotional Classification 

In psychology, a space of universal, culturally independent base emotional states 
have been identified according to the differential emotional theory {e.g., [5,9]); 
ten clusters within this emotional space are generally considered base emotions. 
These are interest, happiness, surprise, sorrow, anger, disgust, contempt, shame, 
fear, guilt (Figure 4). 

While it is typically impossible to analyze the emotional state of the author 
of a sober newspaper article, authors of newsgroup often do not conceal their 
emotions. Given a posting, we use an SVM to determine, for each of the ten 
emotional states, a score that rates the likelihood of that emotion for the author. 
We average the scores over all postings related to a company, or to a product, 
and visualize the result as in Figure 4. We can project emotional scores onto a 
“positive-negative” ray and visualize the resulting score as a red or green bar as 
in Figure 2. 

We manually classified postings to discussion boards into positive, negative, 
and neutral for each of the ten base emotional states. Emotional classification of 
messages turned out to be a fairly noisy process; the judgment on the emotional 
content of postings usually varies between individuals. Unfortunately, we found 
no positive examples for disgust but between 2 and 21 positive and between 16 
and 92 negative examples for the other states. 

Figure 8 shows precision and recall curves for those emotions for which we 
found most positive examples, based on 10-fold cross validation. As we expected, 
recognizing emotions seems to be a very difficult task; in particular, from the 
small samples available. Still the recognizer performs significantly better than 
random guessing. Emotional classifiers with a rather high threshold can often 
achieve reasonable precision values. Also, in many cases in which human and 
classifier disagree, it is not easy to tell whether human or classifier are wrong. 
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Fig. 7. Precision, recall, and accuracy for subject classification. First row: “US trea- 
sury” , “mergers and acquisition” ; second row: “positive” / “negative market and econ- 
omy news” ; third row: “initial public offering announcement” , “currency and exchange 
rates” . 
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Fig. 8. Precision, recall, and accuracy for emotional classification. First row: positive 
versus negative, anger; second row: contempt, fear. 



4 Conclusion 

We describe a system that monitors online news sources and discussion boards, 
downloads the content regularly, extracts the document bodies, analyzes mes- 
sages by content and emotional state, and generates customer-specific press clip- 
pings. A user of the system can specify his or her information needs by entering a 
list of company names {e.g., the name of the own company and relevant competi- 
tors) and selecting from a set of message types and business sectors. Information 
extraction tasks are addressed by rule induction and hidden Markov modes; the 
Support Vector Machine is used to learn classifiers from hand-labeled data. The 
customer-specific news stories are listed individually, as well as summarized by 
several criteria. Diagrams visualize how frequently business sectors or companies 
are cited over time. 

The resulting press clippings are generated in near real-time and fully au- 
tomatically. This tool enables companies to keep track of how they are being 
perceived in news groups and in the press. It is also inexpensive compared to 
press clipping agencies. On the down side, the system is certain to miss all news 
stories that only appear in printed issues. Also, the classifier has a certain inac- 
curacy which imposes the risk of missing relevant articles. Of course, this risk is 
also present with press clipping agencies. Nearly all studied subject categories 
can be recognized very reliably using support vector classifiers. 
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Abstract. In recent years, it has been shown that methods from In- 
ductive Logic Programming (ILP) are powerful enough to discover new 
first-order knowledge from data, while employing a clausal representa- 
tion language that is relatively easy for humans to understand. Despite 
these successes, it is generally acknowledged that there are issues that 
present fundamental challenges for the current generation of systems. 
Among these, two problems are particularly prominent: learning deep 
elauses, i.e., clauses where a long chain of literals is needed to reach 
certain variables, and learning wide elauses, i.e., clauses with a large 
number of literals. In this paper we present a case study to show that 
by building on positive results on acyclic conjunctive query evaluation 
in relational database theory, it is possible to construct ILP learning al- 
gorithms that are capable of discovering clauses of significantly greater 
depth and width. We give a detailed description of the class of clauses 
we consider, describe a greedy algorithm to work with these clauses, and 
show, on the popular ILP challenge problem of mutagenicity, how indeed 
our method can go beyond the depth and width barriers of current ILP 
systems. 



1 Introduction 

In recent years, it has been shown that methods from Inductive Logic Program- 
ming (ILP) [23,32] are powerful enough to discover new first-order knowledge 
from data, while employing a clausal representation language that is relatively 
easy for humans to understand. Despite these successes, it is generally acknowl- 
edged that there are issues that present fundamental challenges for the current 
generation of systems. Among these, two problems are particularly prominent: 
learning deep clauses, i.e., clauses where a long chain of literals is needed to reach 
certain variables, and learning wide clauses, i.e., clauses with a large number of 
interconnected literals. 
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In current ILP systems, these challenges are reflected in system parameters 
that bound the depth and width of the clauses, respectively. Practical experience 
in applications shows that tractable runtimes are achieved only when setting the 
values of these parameters to small values; in fact it is not uncommon to limit 
the depth of clauses to two or three, and their width to four or five. In a recent 
study, Giordana and Saitta [10] have shown, based on empirical simulations, that 
indeed there seems to be a fundamental limit for current ILP systems, and that 
this limit might in large parts be due to the extreme growth of matching costs, 
i.e., the cost of determining if a clause covers a given example. Thus, if matching 
costs could be reduced, it should be possible to learn clauses of signiflcantly 
greater depth and width than currently achievable. 

In this paper, we present an ILP algorithm and a case study which provide 
evidence that indeed this seems to be the case. In our algorithm, we build on 
positive complexity results on conjunctive query evaluation in the area of rela- 
tional database theory, and employ the class of acyclic conjunctive queries where 
the matching problem is known to be tractable. In the domain of mutagenicity, 
we show that using our algorithm it is indeed possible to discover structural 
relationships that must be expressed in clauses that have signiflcantly greater 
depth and width than those currently learnable. In fact, the additional predic- 
tive power gained by these deep and wide structures has allowed us to reach a 
predictive accuracy comparable to the one attained in previous studies, without 
using the additional numerical information available in these experiments. 

The paper is organized as follows. In Section 2, we first briefly introduce the 
learning problem that is usually considered in ILP. In Section 3, we present a 
more detailed introduction into the matching problem and discuss the state of 
the art in related work on the issue. In Section 4, we then formally define the class 
of acyclic clauses that is used in this work, and describe its properties. Section 5 
discusses our greedy algorithm which uses this class of clauses to perform ILP 
learning. Section 6 contains our case study in the domain of mutagenesis, and 
Section 7 concludes. 

2 The ILP Learning Problem 

The ILP learning problem is often simply defined as follows (see, e.g., [32]). 

Definition 1 (ILP prediction learning problem). Given 

— a vocabulary consisting of finite sets of function and predicate symbols, 

— a background knowledge language Lb, an example language Le, and an hy- 
pothesis language Lh, all over the given vocabulary, 

— background knowledge B expressed in Lb, and 

— sets E~^ and E~ of positive and negative examples expressed in Le 

such that B is consistent with E'^ and E~ (B\J U E~ find a learning 

hypothesis H G Lh such that 
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(i) H is complete, i.e., together with B entails the positive examples (H\J B \= 
E+) 

(ii) and H is correct, i.e., is consistent with the negative examples (H \J B \J 
E+\JE~ '^U). 

This problem is called the prediction learning problem because the learning 
hypothesis must be such that together with B it correctly predicts (derives, 
covers) the positive examples, and does not predict (derive, cover) the negation 
of any negative example as true (otherwise the hypothesis would be inconsistent 
with the negative examples). For instance, if flies(tweety) is a positive example 
and ^flies{bello) a negative one then flies{bello) must not be predicted^. 

In order to decide conditions (i) and (ii) in the above definition, one has to 
decide for a single e G E~^ U E~ whether E[ B \= e. This decision problem is 
called the matching or membership problem. We note that in the general problem 
setting defined above, the membership problem is not decidable. Therefore, in 
most of the cases implication is replaced by clause subsumption defined as follows. 
Let Cl and C 2 be first-order clauses. We say that Ci subsumes C 2 , denoted by 
Cl < C 2 , if there is a substitution 6 (a mapping of Ci’s variables to C' 2 ’s terms) 
such that C 16 C C 2 (for more details see, e.g., [25]). 



3 The Matching Problem: State of the Art 

One of the reasons why the width and depth of the clauses in the hypothesis 
language are usually bounded by a small constant is that even in the strongly 
restricted ILP problem settings the membership problem is still NP-complete. 
For instance, consider the ILP prediction learning problem, where (non-constant) 
function symbols are not allowed in the vocabulary, the background knowledge is 
an extensional database (i.e., it consists of ground atoms), examples are ground 
atoms, and the hypothesis language is a subset of the set of definite non-recursive 
first-order Horn clauses, or in other words, it is a subset of the set of conjunctive 
queries [1,30]. This is one of the problem settings most frequently considered in 
ILP real- word applications. Although in this setting, the membership problem, 
i.e., the problem of deciding whether a conjunctive query implies a ground atom 
with respect to an extensional database, and implication between conjunctive 
queries are both decidable, they are still NP-complete [6]. In the ILP commu- 
nity, both of these problems are viewed as instances of the clause subsumption 
problem because implication is equivalent to clause subsumption in the prob- 
lem setting considered (see, e.g., [11]). These decision problems play a central 
role e.g. in top-down ILP approaches (see, e.g., [25] for an overview), where 
the algorithm starts with an overly general clause, for instance with the empty 
clause, and specializes it step by step until a clause is found that satisfies the 
requirements defined by the user. 

^ Strictly speaking the above setting only refers to the training error of a hypothesis 
while ILP systems actually seek to minimize the true error on future examples. 
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As mentioned above, subsumption between first-order clauses is one of the 
most important operators used in different ILP methods. Since the clause sub- 
sumption problem is known to be NP-complete, different approaches can be 
found in the corresponding literature that try to solve it in polynomial time. 
Among these, we refer to the technique of identifying tractable subclasses of 
first-order clauses (see, e.g., [12,18,26]), to the earlier mentioned phase transi- 
tions in matching [10], and to stochastic matching [27]. 

In general, clause subsumption problem can be considered as a homomor- 
phism problem between the relational structures that correspond to the clauses, 
as one has to find a function between the universes of the structures that pre- 
serves the relations (see, e.g., [16]). Homomorphisms between relational struc- 
tures appear in the query evaluation problems in relational database theory or 
in the constraint-satisfaction problem in artificial intelligence (see, e.g., [19]). In 
particular, from the point of view of computational complexity, the query evalua- 
tion problem for the above mentioned class of conjunctive queries is well-studied. 
Research in this field goes back to the seminal paper by Chandra and Merlin [6] 
in the late seventies, who showed that the problem of evaluating a conjunctive 
query with respect to a relational database is NP-complete. In [33], Yannakakis 
has shown that query evaluation becomes computationally tractable if the set of 
literals in the query forms an acyclic hypergraph. This class of conjunctive queries 
is called acyclic conjunctive queries. In [13], Gottlob, Leone, and Scarcello have 
shown that acyclic conjunctive queries are LOGCFL-complete. The relevance of 
this result, besides providing the precise complexity of acyclic conjunctive query 
evaluation, is that acyclic conjunctive query evaluation is highly parallelizable 
due to the nature of LOGGFL. The positive complexity result of Yannakakis 
was then extended by Ghekuri and Rajaraman [7] to cyclic queries of bounded 
query-width. 

Despite the fact that the class of conjunctive queries is one of the most 
frequently considered hypothesis language in ILP, and that acyclic conjunctive 
queries form a practically relevant class of database queries, to our knowledge 
only the recent paper [15] by Hirata has so far been concerned with acyclic con- 
junctive queries from the point of view of learnability^ . In that paper, Hirata has 
shown that, under widely believed complexity assumptions, a single acyclic con- 
junctive query is not polynomially predictable, and hence, it is not polynomially 
PAG-learnable [31]. This means that even though the membership problem for 
acyclic clauses is decidable in polynomial time, under worst-case assumptions the 
problem of learning these clauses is hard, so that practical learning algorithms, 
such as the one presented in section 5, must resort to heuristic methods. 



^ The notion of acyclicity appears in the literature of ILP (see, e.g., [2]), but is different 
from the one considered in this paper. 
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4 Acyclic Conjunctive Queries 

In this section we give the necessary notions related to acyclic conjunctive queries 
considered in this work. For a detailed introduction to acyclic conjunctive queries 
the reader is referred to e.g. [1,30] or to the long version of [13]. 

For the rest of this paper, we assume that the vocabulary in Definition 1 
consists of a set of constant symbols, a distinguished predicate symbol called the 
target predicate, and a set of predicates called the background predicates. Thus, 
(non-constant) function symbols are not included in the vocabulary. Examples 
are ground atoms of the target predicate, and the background knowledge is an 
extensional database consisting of ground atoms of the background predicates. 
Furthermore, we assume that hypotheses in Lh are definite non-recursive first- 
order clauses, or in the terminology of relational database theory, conjunctive 
queries of the form 

Lq ^ Li, . . . , Li 

where Lq is a target atom, and Li is a background atom for i = 1 , ... , 1 . In what 
follows, by Boolean conjunctive queries we mean first-order goal clauses of the 
form 

Li, ... ,Li 

where the Li’s are all background atoms. 

In order to define a special class of conjunctive queries, called acyclic conjunc- 
tive queries, we first need the notion of acyclic hypergraphs. A hypergraph (or 
set-system) H = {V, E) consists of a finite set V called vertices, and a family E of 
subsets of V called hyperedges. A hypergraph is a-acyclic [9] , or simply acyclic, if 
one can remove all of its vertices and edges by deleting repeatedly either a hyper- 
edge that is empty or is contained by another hyperedge, or a vertex contained by 
at most one hyperedge [14,34]. Note that acyclicity as defined here is not a hered- 
itary property, in contrast to e.g. the standard notion of acyclicity in ordinary 
undirected graphs, as it may happen that an acyclic hypergraph has a cyclic sub- 
hypergraph. For example, consider the hypergraph H = ({a, b, c}, {ci, 62, 63, 64}) 
with ei = {a, 6}, 62 = {b,c}, 63 = {a,c}, and 64 = {a,b,c}. This is an acyclic 
hypergraph, as one can remove step by step first the hyperedges ei, 62, 63 (as 
they are subsets of 64), then the three vertices, and finally, the empty hypergraph 
is obtained by removing the empty hyperedge that remained from 64. On the 
other hand, the hypergraph H' = ({a, &, c}, {ei, 62, 63}), which is a subhyper- 
graph of H, is cyclic, as there is no vertex or edge that could be deleted by the 
above definition. In [9], other degrees of acyclicity are also considered, and it is 
shown that among them, a-acyclic hypergraphs form the largest class properly 
containing the other classes. 

Using the above notion of acyclicity, now we are ready to define the class of 
acyclic conjunctive queries. Let Q be a conjunctive query and L be a literal of Q. 
We denote by Var(Q) (resp. Var(L)) the set of variables occurring in Q (resp. L). 
We say that Q is acyclic if the hypergraph H{Q) = (V, E) with V = Var(Q) and 
E = |Var(L) : L is a literal in Qj is acyclic. For instance, from the conjunctive 
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queries 



P{X,Y,X) ^ R{X,Y),R{Y,Z),R{Z,X) 

P{X,Y,Z) ^ R{X,Y),R{Y,Z),R{Z,X) 

the first one is cyclic, while the second one is acyclic. 

In [3] it is shown that the class of acyclic conjunctive queries is identical to 
the class of conjunctive queries that can be represented by join forests [4]. Given 
a conjunctive query Q, the join forest JF{Q) representing Q is an ordinary 
undirected forest such that its vertices are the set of literals of Q, and for each 
variable x G Var(Q) it holds that the subgraph of JF{Q) consisting of the 
vertices that contain x is connected (i.e., it is a tree). 

Now we show how to use join forests for efficient acyclic query evaluation. 
Let if be a set of ground target atoms and B be the background knowledge 
as defined at the beginning of this section, and let Q be an acyclic conjunctive 
query with join forest JF{Q). In order to find the subset E' C E implied by Q 
with respect to B, we can apply the following method. Let Tq, Ti, . . . , {k > 0) 
denote the set of connected components of JF{Q), where Tg denotes the tree 
containing the head of Q, and let Qi Q denote the query represented by R 
for i = 0, . . . , A:. The definition of the Qfs implies that they form a partition 
of the set of literals of Q such that literals belonging to different blocks do not 
share common variables. Therefore, the subqueries Qo? ■ ■ ■ i Qk can be evaluated 
separately; if there is an i, 1 < i < A:, such that the Boolean conjunctive query 
is false with respect to B then Q implies none of the elements of E with respect 
to B, otherwise Q and Qo imply the same subset of E with respect to B. By 
definition, Qq implies an atom e G E if there is a substitution mapping the head 
of Qo to e and the atoms in its body into B, and Qi (f < i < k) is true with 
respect to B if there is a substitution mapping QiS atom into B. That is, using 
algorithm Evaluate given below, Q implies E' with respect to B if and only if 

/ fc 

{E' C EVALUATE(i? U E, To)) A /\(EVALUATE(E,Ti) 7 ^ 0) 

\i=l 

It remains to discuss the problem of how to compute a join forest for an 
acyclic conjunctive query. Using maximal weight spanning forests of ordinary 
graphs, in [4] Bernstein and Goodman give the following method to this problem. 
Let Q be an acyclic conjunctive query, and let G{Q) = (V,E,w) be a weighted 
graph with vertex set U = {T : T is a literal of Q}, edge set E = {(u,u) : 
Var(u) n Var(u) yf 0}, and with weight function w : E ^ fN defined by 

w : (tt, v) I Var(u) n Var(u) | . 

Let MSE{Q) be a maximal weight spanning forest of G{Q). Note that maximal 
weight spanning forests can be computed in polynomial time (see, e.g., [8]). It 
holds that if Q is acyclic then MSF{Q) is a joint forest representing Q. In ad- 
dition, given a maximal weight spanning forest MSF{Q) of a conjunctive query 
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algorithm Evaluate 

input: extensional database D and join tree T with root 
labeled by no 

output: {uqS: 0 is a substitution mapping the nodes of T into D} 

let R — {noO: 0 is a substitution mapping no into D} 
let the children of no be labeled by ni,...,nfc (fc > 0 ) 
for i = 1 to k 

S = evaluate(D, Ti) // Ti is the subtree of T rooted at m 

R = the natural semi join of R and S wrt . no and n; 
endf or 
return R 



Q, instead of using the method given in the definition of acyclic hypergraphs, in 
order to decide whether Q is acyclic, one can check whether the equation 

w{u,v)= Y (Class(a:) - 1) (1) 

(u,v)&MSF(Q) a;eVar(Q) 

holds, where Class (x) denotes the number of literals in Q that contain x (see 
also [4]).^ 

5 A Greedy Algorithm 

The goal of our learning algorithm is to discover sets of acyclic clauses that 
together are correct and complete. From the results of [16] on learning multi- 
ple clauses it follows that this problem is NP-hard, so we resort to a greedy 
sequential covering algorithm (see, e.g., [21]) as it is commonplace in ILP. Our 
sequential covering algorithm takes as input the background knowledge B and 
the set E of examples, calls the subroutine SingleClause for finding an acyclic 
conjunctive query Q, then updates E by removing the positive examples implied 
by Q with respect to B, and starts the process again until no new rule is found by 
the subroutine. It finally prints as output the set of acyclic conjunctive queries 
discovered. 

Now we turn to the problem of how to find a single acyclic conjunctive 
query^. In order to give the details on the subroutine SingleClause called by 

® The reason why Class(a:) — 1 is used in (1) is that the number of edges in a tree is 
equal to its number of vertices minus 1 . 

^ We note that the general problem of finding a single consistent and complete (not 
necessarily acyclic) conjunctive query is a PSPACE-hard problem [17] and it is an 
open problem whether it belongs to PSPACE (see also [16]). On the other hand, it 
is not known whether it remains PSPACE-hard for the class of acyclic conjunctive 
queries considered in this work, or to the other three classes corresponding to /3, 7 , 
and Berge-acyclicity discussed in [9]. 
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the algorithm, we first need the notion of refinement operators (see Chapter 17 
in [25] for an overview). We recall that a special ILP problem setting defined at 
the beginning of the previous section is considered. Fix the vocabulary and let 
L denote the set of acyclic conjunctive queries over the vocabulary. A downward 
refinement operator is a function p : L ^ 2^ such that Qi<Q 2 for every Qi G L 
and Q 2 G p{Qi)- 



algorithm SingleClause 

input : background knowledge B and set E = E'^ U of examples 
output: either 0 or an acyclic conjunctive query Best satisfying 
|Covers(Best,B, A+)|/|A+| > Pcov and Accuracy(Best, P, A) > Pace 

Beam = {P(ri, . . . , a;„) <— } // P denotes the target predicate 

Best = 0 
LastChange = 0 
repeat 

NewBeam = 0 
forall C € Beam 
forall C' £ p{C) 

if |CovERS(C',P,P+)|/|P+| > Pcov then 

if Accuracy(C',P,P) > max(Pacc, Accuracy(Best,P,P)) then 
Best = C' 

LastChange = 0 
endif 

update NewBeam by C' 
endif 
endf or 
endf or 

LastChange = LastChange -i- 1 
Beam = NewBeam 

until Beam = 0 or LastChange > P^jjange 
return Best 



Algorithm SingleClause applies beam search for finding a single acyclic 
conjunctive query. Its input is B and the current set E of examples. It returns the 
acyclic conjunctive query Best that covers a sufficiently large (defined by Pcov) 
part of the positive examples and has accuracy at least Pacc> where Pcov and 
Pace are user defined parameters. If it has not found such an acyclic conjunctive 
query, then it returns the empty set. In each iteration of the outer (repeat) 
loop, the algorithm computes the refinements for each acyclic conjunctive query 
in the beam stack, and if a refinement is found that is better than the best 
one discovered so far then it will be the new best candidate. The beam stack is 
updated according to the rules’ quality measured by Accuracy. Finally we note 
that the outer loop is terminated if no candidate refinement has been generated 
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or in the last ^*change iterations of the outer loop the best rule has not been 
changed, where -P^jiange ^ defined parameter. 

6 Case Study: Mutagenesis 

Chemical mutagens are natural or artificial compounds that are capable of caus- 
ing permanent transmissible changes in DNA. Such changes or mutations may 
involve small gene segments as well as whole chromosomes. Carcinogenic com- 
pounds are chemical mutagens that alter the DNA’s structure or sequence harm- 
fully causing cancer in mammals. A huge amount of research in the field of 
organic chemistry has been focusing on identifying carcinogen chemical com- 
pounds. 

The first study on using ILP for predicting mutagenicity in nitroaromatic 
compounds along with providing a Prolog database was published in [29] . This 
database consists of two sets of nitroaromatic compounds from which we have 
used the regression friendly one containing 188 compounds. Depending on the 
value of log mutagenicity, the compounds were split into two disjoint sets (active 
consisting of 125 and inactive consisting of 63 compounds). The basic structure 
of the compounds is represented by the background predicates ‘atm’ and ‘bond’ 
of the form 

atm(Compound Jd, Atom Jd,Element, Type, Charge), 

bond(CompoundJd,Atoml_Id,Atom2_Id,BondType) , 

respectively. Thus, the background knowledge B can be considered as a la- 
beled directed graph. In order to work with undirected graph, for each fact 
bond(c, u, u, t) we have added a corresponding fact bond(c, u, m, t) to B. In ad- 
dition, in our experiments we have also included the background predicates 

— benzene, carbon_6_ring, hetero_aromatic_6_ring, ring6, 

— carbon_5_aromatic_ring, carbon_5_ring, hetero_aromatic_5_ring, ring5, 

— nitro, and methyl. 

These predicates define building blocks for complex chemical patterns (for their 
definitions see the Appendix of [29]). We note that we have not used the available 
numeric information (i.e., charge of atoms, log P, and £lumo)- 

In our experiments we used a simple refinement operator allowing only adding 
new literals to the body of an acyclic conjunctive query, and not allowing the 
usual operators such as unification of two variables or specialization of a variable. 
That is, a refinement of an acyclic conjunctive query is obtained by selecting one 
of its literals, and, depending on the predicate symbol of the selected literal, by 
adding a set of literals to its body as follows. If the literal is the head of the 
clause we add either a single ’atm’ literal or a set of literals corresponding to one 
of the building blocks. If the literal selected is an ’atm’ then we add either a new 
atom connected by a bond fact with the selected one, or we add a building block 
containing the selected atom. If a bond literal has been selected then we add a 
building block containing the current bond. Such building blocks are a common 
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element specifiable in several declarative bias languages already in use in ILP 
(see e.g. the relational cliches of FOCL [ 28 ] or the lookahead specifications of 
TILDE [ 5 ]); at present, they are simply given as part of the refinement operator®. 

As an example, let 

Q : active(xi, a;2) <— . ■ . ,L,. . . 

be an acyclic conjunctive query, where L = bond(xi, Xi, xj, 7 ). Then a refinement 
of Q with respect to L and building block benzene is the acyclic conjunctive query 

Q' = QU {bond{xi,Xj,yi,7),hond{xi,yi,y2,7),hond{xi,y2,y3j), 
bond(xi, j/3, ?/4, 7 ), bond(a;i, 7 / 4 , a;*, 7 ), atm(xi, 2/1, c. Ml, ui), 
atm(xi,?/2,c,M2,M2),atm(a;i,2/3,c, M3, M3), atm(a:i, 7/4, c, M4, M4), 
benzene(xi , , 7/1 , 7/2 , 7/3, 2/4) } , 

where the y’s, m’s, and v’s are all new variables. Note that despite the fact that 
the new bond literals together with L form a cycle of length 6, Q' is acyclic, 
as we have also attached the benzene literal containing the six corresponding 
variables. It holds in general that the refinement operator used in our work does 
not violate the acyclicity property. Finally we note that only properly subsumed 
refinements have been considered (i.e., if Q' is a refinement of Q then Q' ^ Q). 

In order to see how our restriction on the hypothesis language influences the 
predictive accuracy, we have used 10-fold cross-validation with the 10 partitions 
given in [ 29 ]. Setting parameters Pcov to 0 . 1 , Pace to 125/188 (default accuracy), 
the size of the beam stack to 100, and Pc^ange ^ (note that this is not a 
depth bound) , we obtained 87 % accuracy. Using the ILP system Progol [ 22 ] , the 
authors of [ 29 ] report 88% accuracy, and a similar result, 89 % was achieved by 
STILL [ 27 ] on the same ten partitions. However, in contrast to our experiment, 
in the Progol and STILL experiments, the numeric information was considered 
as well. 

As an example, one of the rules discovered independently in each of the ten 
runs is 

active(xi, a;2) ^ 
atm(xi, X3, c, 27 , X4), 

bond(xi, X3, a;s, X25), bond(a;i, X5, X6, a;2e), bond(xi, xe, X7, X27), 
bond(xi, X7, xs, X2s), bond(a;i, X8, xg, X29), bond(xi, xg, xg, 0:30), 
atm(xi, X5, xio, xii, X12), atm(cci, xe, Xi3, X14, X15), atm(xi, X7, Xie, X17, Xis), 
atm(xi, X8, xig, X20, 2^21), atm(xi, a;g, X22, X23, 2:24), 
ring6 (x 1 , X3 , X5 , X6 , 2:7 , X8 , 2;g ) , 

bond(xi, X7, X31, X32), atm(xi, X31, c, 27 , X33) ( 2 ) 

® Note that the use of such building blocks facilitates the search by making wide and 
deep clauses reachable in fewer steps, but of course does not change the complexity 
of the membership problem. Thus, even when given these building blocks, such 
clauses would be difficult to learn for other ILP learners due to the intractable cost 
of matching. 
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(see also Fig. 1). Applying the notion of variable depth given in [25], the depth 
of the above rule is 7 according to the depth of the deepest variable X 22 ■ Further- 
more, its width is 15. Finally we note that using the standard Prolog backtracking 
technique, just evaluating the single rule above would take on the order of hours. 




Fig. 1. A graphical representation of the body of rule (2). 



7 Conclusion 

In this paper, we have taken the first steps towards discovery of deep and wide 
first-order structures in ILP. Taking up the argument recently put forward by 
[10], our approach centrally focuses on the matching costs caused by deep and 
wide clauses. To this end, from relational database theory [1,30], we have in- 
troduced a new class of clauses, a-acyclic conjunctive queries, which has not 
previously been used in practical ILP algorithms. Using the algorithms sum- 
marized in this paper, the matching problem for acyclic clauses can be solved 
efficiently. As shown in our case study in the domain of mutagenicity, with an 
appropriate greedy learner as presented in the paper, it is then possible to learn 
clauses of significantly greater width and depth than previously feasible, and 
the additional predictive power gained by these deep and wide structures has in 
fact allowed us to reach a predictive accuracy comparable to the one attained in 
previous studies, without using the addition numerical information available in 
these experiments. 

Based on these encouraging preliminary results, further work is necessary 
to substantiate the evidence presented in this paper. Firstly, in the case study 
presented here, we have used quite a simple greedy algorithm, so that further 
improvements seen possible with more sophisticated search strategies (see e.g. 
[20] ) . Secondly, further experiments are of course necessary to examine in which 
type of problem the advantages shown here will also materialize; we expect this 
to be the case in all problems involving structurally complex objects or rela- 
tionships. To facilitate the experiments, we will switch to a refinement operator 
based on a declarative bias language (see [24] for an overview), as is common- 
place in ILP. Finally, it appears possible to generalize our results to an even 
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larger class of clauses, by considering certain classes of cyclic conjunctive queries 

which are also solvable in polynomial time (see e.g. [7]). 
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Abstract. We propose a preprocessing method for Web mining which, 
given semi-structured documents with the same structure and style, dis- 
tinguishes useless parts and non-useless parts in each document without 
any knowledge on the documents. It is based on a simple idea that any 
n-gram is useless if it appears frequently. To decide an appropriate pair 
of length n and frequency a, we introduce a new statistic measure al- 
ternation count. It is the number of alternations between useless parts 
and non-useless parts. Given news articles written in English or Japanese 
with some non-articles, the algorithm eliminates frequent n-grams used 
for the structure and style of articles and extracts the news contents and 
headlines with more than 97% accuracy if articles are collected from the 
same site. Even if input articles are collected from different sites, the 
algorithm extracts contents of articles from these sites with at least 95% 
accuracy. Thus, the algorithm does not depend on the language, is robust 
for noises, and is applicable to multiple formats. 



1 Introduction 

Data mining is a research field to develop tools that find useful knowledge from 
databases [3,4,14]. In this field, databases is assumed to have explicit and static 
structures. On the other hand, resources on the WWW do not have such struc- 
tures. Web mining is a field of mining from such resources and text mining is 
mining from unstructured or semi-structured documents. We consider Web or 
text mining from semi-structured documents. A semi-structured document have 
tree structures, such as HTML/XML files, BiBTeX files, etc [1]. 

Since resources on the Web are widely distributed and heterogeneous, it is 
important to collect documents and clean them. When we collect a large amount 
of documents, we use hyperlinks for efficiency. For example, search engines pro- 
vide hyperlinks. Since a hyperlink is not created by the collecter, we can not 
assume that collected documents are well cleaned. Some of them are written in 
different languages and are far from the desired topic. Thus, Web mining algo- 
rithms and preprocessors should be robust for noise and should be applicable to 
any natural and markup languages. 
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Usual mining algorithms assume that collected documents are written in the 
same language and preprocess them with some knowledge, such as the gram- 
mar of HTML to remove tags, stop word lists [10], stemming technique [15], 
morpheme analysis, etc. They depends on natural and markup languages. Some 
algorithms require additional input documents as background knowledge. In ad- 
dition to an input set of documents, the algorithm in [5] requires another set of 
documents in order to remove substrings highly common to both sets. 

In this paper, we present an algorithm that cleans collected documents with- 
out any knowledge on them. From input documents, this algorithm finds a set 
of frequent n-grams. Using this set, we can eliminate tags or directives, and 
stereotyped expressions in the documents because they are common to the col- 
lected documents and so useless. Eliminating or finding useless parts contrasts 
with usual mining algorithms which find useful knowledge such as association 
rules [3], association patterns [4], word association patterns [5], and episode 
rules [14]. 

An input for our algorithm is a set of semi-structured documents which 
contain the same structure and style such as static Web pages in the same site 
or dynamic pages generated with search facility. Since the number of Web sites 
providing search facilities is increasing [7] , the number of Web pages applicable 
to our algorithm is large. In such pages, there exist frequent substrings, such as 
the name of the site, navigation and advertisement links, etc. Moreover, there 
exist common tags or directives which structure texts because they have the 
same structure and style. If we see such pages, we usually put our mind on the 
variable part and ignore invariable parts. In this sense, substrings to describe 
the same structure and style in such pages are useless. 

We treat a document as just a string and define any frequent n-grams are 
useless. An n-gram is just a string with length n, so that our algorithm does 
not depend on natural and markup languages. Once an appropriate pair (n, a), 
which is called a cut point, of a length n and frequency a is decided, we divide 
each of documents into two parts using the pair as follows: if the frequency of an 
n-gram is in the top a percent of the frequencies of all n-grams, then the n-gram 
is useless. 

To decide an appropriate pair (n, a), we introduce a new statistic measure 
alternation count. It is the number of changes between useless parts and non- 
useless parts in a document. For a set D of documents, alternation count of D 
is the sum of all alternation counts. An alternation count shows how many 
times useless (or non-useless) parts appear in documents. A large alternation 
count splits a document into too small pieces. In this case, structures of each 
document are destroyed. Therefore, our algorithm searches cut points from (2, 1) 
while the alternation count of the current cut point is decreasing. The algorithm 
stops if the alternation count becomes to be greater than the current one when 
it increases n or a by one. 

We define an optimal cut point (n, a) which attains a locally minimum alter- 
nation count and develop an algorithm that finds an optimal cut point. It runs 
in 0{n^N + nN log N) time, where N is the total length of input and n is the 
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length of an optimal cut point. Experimentally, n is less than 30 and n N, so 
the time complexity is approximately 0{N log N). 

We present experimental results using news articles as input for the algo- 
rithm. The articles are written in English or Japanese. For articles collected 
from the same site, the algorithm detects the tag regions of HTML files and 
highly common expressions in the articles as useless. It extracts the news con- 
tents as non-useless parts with more than 97% accuracy even if non-articles 
are contained as noise data. Therefore, the algorithm does not depend on the 
language and is robust for noises. 

To evaluate experimental results, we manually define non-useless parts of 
articles for each data set. Comparing the manually defined division with the 
division by the algorithm, the accuracy is defined to be the number of letters 
categorized to the same part (useless or non-useless) to the total length of the 
input. 

We also evaluate the accuracy for documents collected from different sites 
with any combination of English and Japanese sites. The algorithm extracts 
contents of articles from these sites with at least 95% accuracy. Since the algo- 
rithm simply counts n-gram frequencies, a extremely small set of articles would 
be ignored in a combination with large data sets. However, some expriments 
show that the algorithm detects the contents of articles in such a small set as 
well as those in a large set. 

This paper is organized as follows. In the next section, basic notations are 
given and then the key notion, the alternation count, is introduced. In Section 3, 
we present an algorithm that divides a document of given documents into useless 
and non-useless parts. Complexity required by the algorithm is presented in 
Section 3.1. Experimental results are shown in Section 4. 

2 Alternation Connt and Optimal Cut Point 

2.1 Preliminaries 

The set H is a finite alphabet. Let cc = ai • • • o„ (o^ G if for each i) be a string 
over E. We denote the length of x by \x\. An n-gram is a string whose length 
is n. For an integer 1 < i < |x|, we denote by x[i] the ith letter of x. Let x and y 
be two strings. The concatenation of x and y is denoted by a; • y or simply by xy. 
We denote x = y ii\x\ = \y\ and x[i] = y[i] for each 1 < t < \x\. 

For a string x, if there exist strings u,v,w G E* such that x = uvw, we say 
that V is a substring of x. An occurrence of u in a; is a positive integer i such 
that x[i] • ■ ■ x[i-\- |u| — 1] = f . Using the occurrence and the length of u, v is also 
denoted by x[i..i |w| — 1]. 

li E = {0, 1}, then a string over E is called a binary string. For i G E and 
X G E* , [x\i denotes the number of f’s in x. For two binary strings x and y with 
the same length, bitwise “and” and “exclusive-or” operations defined as follows. 
xhy is a binary string with length |a;| such that xhy[i] = 1 if a;[t] = = 1 

and x^y[i] = 0 otherwise. x~y is also a binary string with length \x\ such that 
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x~y[i] = 1 if x[i] ^ y[i] and x~y[i] = 0 otherwise. For example, if a; = 01101 and 
y = 11100, then x&zy = 01100, x~y = 10001, [x&j/]o = 3, and [x~y]i = 2. 

Let a; be a string (not limited to be binary) and W = {wi, . . . , Vn} be a set 
of substrings of x. A range string of W on x, denoted by rx{W), is a binary 
string with length |x| such that rx{W)[j] = Qiii<j<i + \vk\ — 1 for some 
occurrence i of Vk {I < k < n) and rx{W)[j] = 1 otherwise. A successive O’s 
on the range string shows intervals on x covered by the substrings in W. For 
example, let x = accbaacbc and W = {cb,ba}. Then rx{W) = 110001001. 

2.2 Alternation Count 

In this section, we introduce the key notion alternation count. We consider semi- 
structured documents with the same structure and style such as static pages 
of one site and dynamic pages created by a search facility. In such documents, 
there exists frequent substrings for the structure and style and many users are 
not interested in them. Therefore, we define useless parts of documents as follows. 

Definition 1. Let D be a set of strings. Then, a substring of a string in D is 
said to be useless when it appears frequently in D. 

We treat a semi-structured document as just a string. Note that we do not define 
“importance”, “siginificance” , or “usefulness” like other researchs on text and 
Web mining. 

Next, we consider how many times a substring appears we can say that it 
does “frequently” . The measure for it is new notion alternation count. It is, given 
a string and a set of substrings of the string, the number of changes from a part 
of the string covered by given substrings to the other part, and vice versa. 
Definition 2. Let x be a string and W be a set of substrings of x. Then, the 
alternation count of W on x, which is denoted by Ax{W), is the number of 
boundaries between different value’s (t) and \) on the range string rx{W). 

Example 1. Let x = accbaacbc and W = {cb,ba}. Then Ax{W) = 4 because 
X = accbaacbc (a part of underlined letters is cb or ba) and rx{W) = 110001001. 

The above definition is easily extended to a set of strings instead of a single 
string X. The alternation count of kF on a set D of strings is the sum of alter- 
nation counts oi X € D and denoted by Ad{W). 

2.3 Optimal Cut Point 

Our algorithm is required to receive a set of semi-structured documents and 
divide them into two parts of substrings — useless parts and non-useless parts. 

Since we treat a semi-structured document as just a string, an input for 
our algorithm is a set D = {xi , X 2 , . . . , x„} of strings. To express useless or non- 
useless parts, we use a set of substrings of Xi G D. Thus, our algorithm is required 
to receive D and decide W such that consecutive O’s on (IF) (i = 1, 2, . . . , n) 
cover substrings on Xi for the structure and style. 




Eliminating Useless Parts in Semi-structured Documents 



117 



string 

substring for structure and style 
frequent substring 



Useless (gray) and non-useless (white) parts 




I ■ I 



■ I ■ II 



El 



Frequent substrings (black) appears both parts. 



Fig. 1. Two strings are the same. The above one shows that where are substrings for 
the structure and style. The other one shows that the frequent substrings (black parts) 
fails to cover gray parts 



The most simple method to find frequent substrings of D is to enumerate 
all substrings of D and decide a boundary between frequent substrings and 
non- frequent ones according to some measure. However, W constructed by this 
method may contain short substrings and they appears in non-useless parts as 
well as useless parts. And, the method requires a large time complexity because 
there exist 0{N“^) substrings for a string with length N . 

Instead of this, we make the algorithm to decide the appropriate length 
of substring as well as the appropriate frequency. In other word, we use n- 
grams instead of substrings with any length. In our algorithm, the frequency is 
expressed by a percentage. A pair (n, a) of a length and a frequency decides W 
such that W is the set of the top a percent frequent n-grams in D. In the sequel, 
we denote Au{W) by Ajj(n,a). The pair (n,a) is called a cut point of D. An 
n-gram is said to be frequent on a cut point (n, a) if the n-gram in W decided 
by (n,a). 

Since an appropriate cut point depends on input documents, we make the 
algorithm to it automatically. First, we consider what is an appropriate cut 
point. Two strings in Fig. 1 show the same string. The above string shows that 
substrings for the structure and style are colored with gray. An appropriate (n, a) 
covers gray parts with frequent n-grams. The below string shows that frequent 
n-grams appear in both gray and white parts. In this case, the alternation count 
is larger than the ideal alternation count and substrings for the structure and 
style are destroyed. 

Since short substrings seems not to construct structures and styles, they 
appear everywhere. Therefore, an alternation count may be large if n is too 
small. An alternation count also may be large if a is smaller than the ideal one 
since increasing a connects separate frequent n-grams. If n or a is greater than 
the corresponding ideal one, the alternation count also become large. Thus, an 
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appropriate cut point if both n and a are enough large and the pair (n, a) attains 
a locally minimum alternation count. 

A path is a sequence of cut points (ni, ai), (n 2 , 02 ), . • . , (n^, Uk) such that (1) 
either rii+i = rij + 1 or a^+i = 0 ^ + 1 for each i = 1, 2, . . . , fc — 1, (2) Au{ni, Ui) > 
AD{n^+i,ai+i) for each i = 1,2, . . . ,k - 1, and (3) Aoirik, ak) < Aniuk + 1, a^) 
and A]j{nk,ak) < A]j{nk,ak + !)• A cut point (n, a) is optimal if there exists a 
path from (2, 1) to (n, a). Note that, there exist some optimal cut points. 

The trivial initial cut point is (1, 1). However, 1-gram is too short to describe 
the structure and style of documents. Thus, we define the initial cut point is 
( 2 , 1 ). 

Using above notions, we define that eliminating useless parts is, given a set D 
of strings, to find an optimal cut point of D. 



3 Algorithm 

In this section, we describe FindOptimal that finds an optimal cut point (n,a) 
(see Fig. 3). From the initial cut point (2, 1), the algorithm compares alternation 
counts on the next two cut points with the currnet one. It stops when alternation 
counts on both next two cut points are greater than the current one. 

The algorithm does not remove any tags or directives of semi-structured 
documents. It only modifies documents according to the following conventional 
preprocessing rules: tabs and newlines are treated as a space, and consecutive 
spaces are compressed into one space. An input for the algorithm is a set of 
strings preprocessed according to the above rules. 

FindOptimal uses two subroutines alternation (see Fig. 2) and countsort. 
The subroutine countsort receives an integer n and a set D of strings, then 
counts all n-grams in D and sorts them by the number of their occurrences. 
The subroutine keeps the n-grams and the numbers of their occurrences in a 
hash table. It returns an array of the counted substrings sorted in the decreasing 
order. 

The subroutine alternation receives a set D of strings, an array O of strings, 
a length n, and a percentage a. The variable W used in alternation keeps the 
first a/100 strings in the sorted array O which is an output of countsort. Us- 
ing W , the subroutine construct a range string r and then counts the boundaries, 
which is the alternation count on (n, a) . 

The main algorithm FindOptimal (see Fig. 3) receives a set D of strings. It 
counts the alternation count on the current cut point and also counts alternation 
counts on next two candidates, {n, a -F 1) and (n -F 1, a). After comparison alter- 
nation counts on these three cut points, it decides the next cut point. If both 
A£)(n, a-F 1) and Ao{n + l,a) are small than A£>{n,a), the algorithm selects 
the cut point providing a smaller alternation count. If there is no next cut point 
providing a smaller alternation count than the current one, then it returns the 
currnet cut point as an optimal cut point. 
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function alternation (var D: set of strings; 

0: array of strings, n: integer ; a: integer): integer; 
var 

i: integer; 
s,x: string; 

W: hash table; 
r: array [l..|x|] of integers; 
begin 

X string concatenated all strings in D; 
for i := 1 to |x| do r[i] := 1; 

W substrings 

from the first substring to the a/ 100-th substring in 0; 
for i := 1 to |x| do begin 
s := x[i..i -f n] ; 
if s € W then 
r[i..i -|- n] := 0 ; 
end {for} 

count boundaries between r[i] = 0 and r[i] = 1; 
return the boundaries; 
end ; 

Fig. 2. The subroutine returns the alternation count of D on (n, a) 

3.1 Complexity 

In this section, we discuss the time complexty required by FindOptimal. Let D 
be an input set of strings and N be the total length of D. 

First, we estimate the complexity required by the subroutine count sort. It 
is required 0(1) time to add a new n-gram to a hash table or to check if a 
given n-gram is already stored in the hash table. Since the subroutine counts all 
substrings with the same length, there exist at most 0{N) n-grams. Therefore, 
count sort needs 0{N) time to construct the hash table and 0{N log N) time 
to sort n-grams. Thus, count sort runs in 0{N log N) time. 

Next, we consider the subroutine alternation. Let {D,0,n,a) be an input 
for the subroutine. 0{N) time is required to construct a hash table W. In the 
last for loop, check if s G W requires 0(1) using the hash table W and writing 0 
on r[i..i + n] requires 0(n) time. Therefore, this loop is completed in 0(nN) 
time. Counting boundaries is done in 0{N) by scanning r from left to right. 
Thus, the subroutine runs in 0{nN) time. 

Finally, we estimate the time complexity of FindOptimal. Let (nf,af) be 
the final output of FindOptimal. FindOptimal calls countsort ny — 1 times 
and alternation at most 3(n/ + af — 2) + 1 times because it passes through 
Uf + af — 2 cut points from the initial cut point (2,1) to (nf,af). Thus, the 
routine runs in the following time: 

0{rifN\ogN + (uf + af)ufN) 

= 0{ufN log N + rifN + ajUfN) 

= 0{n]N + nfN\ogN), 
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procedure FindOptimal (var D: set of strings); 

{ FindOpt imal finds an optimal cut point.} 
var n, a: integer; 

{n and a keep the current length and percentage, respectively.} 
var valo, vail, val 2 : integer; 
var 0cur,0next: array of strings; 
begin 

n := 2; a ;= 1 ; {initialize n and a.} 

Qcur count sort (D, n) ; 

valo : =alternation(D, Ocur, n, a) ; 

{valo keeps the alternation count on the current (n, a) . } 
while (n < max{|d| | d G D}) and (a < 100) do begin 
if count sort (D, n + 1) is not done then 
Dnext := count sort (D, n + 1) ; 
vail :=alternation(D, Qcur, n, a + 1) ; 
val 2 :=alternation(D, Qnext, n + 1, a) ; 

{vail and val 2 keep the alternation counts on next candidates.} 
if (valo < vail) and (valo < val 2 ) then 
goto OUTPUT ; 

else if (valo > vali) and ( vali < val 2 ) then begin 
a := a + 1 ; 
valo := vali ; 
end 

else if ((valo > val 2 ) and (val 2 <vali)) then begin 
n := n + 1 ; 
valo := val 2 ; 

Ocur . — Onext ; 
end 

end ; {while} 

OUTPUT : 
report (n, a) ; 
end ; 

Fig. 3. The main algorithm FindOptimal outputs an optimal cut point using two sub- 
routines count sort and alternation 
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where a/ is a non-negative constant less than 100. Our experimental results show 
that n/ is less than 30 and n N (see Section 4). Thus, the time complexity is 
approximately 0{N log N) . 



4 Experiments 

We use news articles as input for FindOptimal. An article we use is written in 
English or Japanese. It is provided as an HTML file which has the headline and 
the body of it. 

We have two types of experiments depending on the the number of sites from 
which we collect articles: articles from the same site (see Section 4.2) and articles 
from different sites (see Section 4.3). 



4.1 Evaluation 

To evaluate an outputed cut point, we utilize two approaches. We modify HTML 
files in which any frequent n-gram on the cut point is colored with gray like 
accbaacbc. A colored letter corresponds to 0 on the range string 110001001. 

The other approache is to calculate accuracy, recall, and precision using a 
binary string called a correct string. We can conclude that FindOptimal outputs 
an appropriate cut point if a range string r^ln, a) is similar to the corresponding 
correct string. 

Let D be an input for FindOptimal and x be a string concatenated all strings 
in D. A correct string c is a binary string with length |x| such that, for 1 < z < |c|, 
c[z] € {0, 1} is decided according to manually specified pairs of delimiters. Let 
{I, r) be a pair of delimiters and zt be a substring of c. Then, u = 11 • • • 1 if lur 
be a substring of x, u = 00 • • • 0 otherwise. A substring surrounding with the 
pair is the headline or the body of an article in our experiments. The positions 
corresponding to the substring are filled with 1 in the correct string. 

We define that the body and the headline of an article are not useless, ex- 
tract manually left and right delimiters from each data set, and construct cor- 
rect strings using pairs of delimiters. Then, using two binary string, the correct 
string c and the range string r, we define that accuracy is [c~r]o/|x|, recall is 
[c&r]i/[c]i, and precision is [c&r]i/[r]i. The accuracy is the ratio of positions i 
such that c[z] = r[i] to the total length of input documents. 



4.2 Articles from the Same Site 

In this section, two sets of documents are considered. One is a set of 76 articles 
obtained from “The Washington Post (http://www.washingtonpost.com/)”. 
This set is denoted by WPOST. All pages linked from the URL are collected 
and then non-article pages are removed manually. Articles in any categories are 
included, and so do any other data sets in this paper. The total size of WPOST 
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<HTML><HEAD> <style type=’'text/css”>DDD1370DDD <META NAME= 
"edition" CONTENT— ” M2” > <META NAME— ” document _name” CON- 
TENT^” A31243-2001Jan22” > <META NAME^” source” CONTENT^”Post” > 
<META NAME^” section” CONTENT^” DM” > <META NAME^”page” 
CONTENT^”E01 ”> <META NAME^” column” CONTENT^””> <META 
NAME^”slug” CONTENT^” BANK23”> <META NAME^” timestamp” CON- 
TENT^”07:0S AM”> <META NAME^” category” CONTENT^” BIZ” > <META 
NAME— ” wordcount” CONTENT— ”0” > <META NAME— ”sourceNumbor" 
CONTENT^”6”> <!-plsfield:title-> <TITLE>McColl Shuts Books on 
an Era {washingtonpost.com)</TITLE> </HEAD>DOD15200DDO <!- 
plsfield:headlinc-> <FONT FACE— ”Arial, Helvetica” SIZE— ” +1” ><B>McColl 
Shuts Books on an Era</B></FONT> <!— plsfield:stop->DnD2050DDD 
< A HREF—" / cgi-bin/gx.cgi / AppLogicH-FTContentServer? 

pagename=wpni/email&articleid=A31243-2001 Jan22&:node=business” ><B>E- 
Mail This Articlc</B></A><BR></FONT></TD> □□□470DDD<A 
HREF—” / ac2/wp-dyn/A31243-2001Jaii22?language— printer” ><B> 

Printer-Friendly Version</B></A><BR></FONT></TD> <TD 

WIDTH^” 8” HEIGHT^” 1” XSPACER TYPE^” block” WIDTH^” 8” 
HEIGHT^” 1”X/TD></TR> <TRxTD COLSPAN^”4” 

WIDTH^”226” HEIGHT^”8” XSPACER TYPE^”block” WIDTH^”226” 
HEIGHT^”8”X/TDX/TRX/TABLE> </TDx/TR> <TRxTD 
WIDTH^”22S” HEIGHT^”!” XSPACER TYPE^”block” WIDTH^”226” 
HEIGHT^” 1”X/TDX/TRX/TABLE> </TDx/TRx/TABLE> 

<FONT SIZE— ”2”> <!-plsfield:byline-> <I>By Kathleen Day</IXBR> <!- 
plsfield:credit-> Washington Post Staff Writer<BR> <!-plsfield:disp_datc-> Tues- 
day, January 23, 2001: Page EOl <BR> </FONT> </P> <!-plsfield:dcscription-> 
<P><PX/P> <PxP>The expected resignation tomorrow of Hugh McColl 
as head of Bank of America symbolically marks the end of an era in banking, 
where for two decades mergers have been an engine of growth and the idea 
that bigger is better has been gospel. </P> <PXP> His departure from the 
nation’s largest consumer bank comes less than a year after his fierce crosstown 
competitor in Charlotte, Edward C.QDn4580DnD ’’They have a mutual respect 
for each other,” said Virginia Stone Mackin, spokeswoman for First Union, who 
until a few years ago worked at Bank of America. </P> <PXP> McColl 
was among the first to call Crutchfield when he was diagnosed with cancer, 
she said. And the two have been known to spent the evening talking after 
bumping into one another at parties or the country club.</P> <!-plsfield:end-> 
<PXCENTER> © 2001 The Washington Post Company </CENTERx/P> 
<PXCENTER> <A HREF— ”/wp-dyn/business/A32068-2001Jan22.htmr Ximg 
src=”http:/ /al88.g.akamaitech. net /f/ 188/920/ Id/ www. washingtonpost.com/ wp- 
srv/iniages/arrow .previous. gif’ width=”14” height^ 

”12” border^”0”> □□□8700DDD <A HREF^ 

” http: / /www. washingtonpost.com/wp-srv/maps/mit_foto. map” XlMG SRC= 

’’ http: / /al88.g.akamaitech.net/f/ 188/920/ Id/ WWW. washingtonpost.com/ wp- 
srv/images/channelnav_news.gif' WIDTH— ”760” HEIGHT— 

”16” BORDER— ”0” ALT— ’’channel navigation” ISMAP— 

”true”x/AxBRx/TD> </TRxTR> <TDx/TD> <TDxIMG SRC- 
” http: / /al88.g.akamaitcch.nct/f/ 188/920/ Id/ WWW. washingtonpost.com/ wp- 
srv/globalnav/images/spacer.gif WIDTH— ”428” HEIGHT—”!” BORDER— ”0” 
ALT-” ”></TD> <TDX/TD> <TDx/TD> </TRx/FORMx/TABLE> 
<FONT SIZE-”-2”XBRx/FONT> </TDx/TRx/TABLE> 

</BODYx/HTML> 



Fig. 4. An HTML file of WPOST where frequent n-grams on (24, 10) are colored with 
gray 
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<html> <hcad> <titlc> Yomiuri On-Linc/nr.io5250UUn<font sizc=” f2"><b> 
/hJH ^ t0 . S ^ «t IE » ffl Sr </b></fontXbr> <img 

8rc=”/g/<^g*'f” width=”r* hcight=”5”Xbr> <inig src=”/g/db.gif’ width=”4f)5’‘ 
bcight=’'' 1” XbrXimg src=’'/g/^Egif’ width=”l’' hcights=”15*‘ ><br> <!- photo 
start -> <!' NO PHOTO > <!• photo end -> <!- honbmi start --> <p> 

TBrii<4{c:oV'T. </p> <P> 

tit. 

^ p uy5:»;4cir«-^KI*L/5: t 

</p> <p> ^h\t. ^ 

hi> i-t, -wmt 

t. «a[lCF^2&Srl^f=r'r6itf^Sr^b-rv'^. </p> (5 ^ 1 
R 21: 19)<br> <!- honbun end -> <div align=’‘right”Ximg src=’Ygi^/arw.gif* 
width=” 11’* hcight=’’ 1,1” >□□ 0580000 n < LAYER SRC=” /srcfilos/spccials.htm” 
VISIBILITY=luddcn ONLOAD— ’-move To Absolnto( specials. pageX, specials. pagcY); 
vi8ibility=truc;” ></LAYBn> </body> </litml> 



Fig. 5. An HTML file of YOMIURI where frequent n-grams on (27, 10) are colored 
with gray 



is about 3.2M Bytes (average 42K Bytes), the minimum size of the articles is 
31K Bytes, and the maximum size is 65K Bytes. 

Given WPOST, FindOptimal outputs (24, 10) as an optimal cut point. Fig. 4 
is an HTML file in WPOST where frequent n-grams on (24, 10) are colored with 
gray. Some letters are omitted because the file is too large. A number surrounded 
by three boxes □, such as □□□5250nno, denotes the approximative number of 
omitted letters. The color of omitted letters is the same as the color of the boxes 
and the number. In Fig. 4, the longest black substring completely equals to the 
body of the article except for the last period. Short black strings are substrings 
of the headline, which appear twice, or a substring of the file name in which this 
article is contained. A file name is unique in this set, so that a part of the file 
name is remains not to be colored. 

The accuracy, recall, and precision on WPOST is 0.975, 0.939, and 0.872, 
respectively. The precision is relatively lower because unique substrings are in- 
cluded in the header of an HTML file such as a part of the file name. 

The other set consists of articles written in Japanese. They are collected 
from “Yomiuri On-Line (http : //www . yomiuri . co . jp/)” . This is constructed by 
collecting all linked pages from all categories. For example, economic articles are 
collected from http://www.yomiuri.co.jp/02/index.htm. This set is denoted 
by YOMIURI. The number of the articles in YOMIURI is 198, the total size of 
YOMIURI is about 2.7M Bytes (average 13. 4K Bytes), the minimum size of the 
articles is 297 Bytes, and the maximum size is 39K Bytes. 

YOMIURI includes non-article files as noise data. The shortest file is not an 
article file. Moreover, files whose size are smaller than 12K Bytes are top pages 
of categories. Such files have only hyperlinks to articles. The number of noises 
in YOMIURI is 14 and its size is about 36K Bytes. 

Given YOMIURI, FindOptimal outputs (8, 10) as an optimal cut point. The 
accuracy, recall, and precision on YOMIURI is 0.992, 0.808, and 0.991, respec- 
tively. 
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Fig. 5 is a modified HTML file. Black parts are parts of the headline and the 
body of the article. The headline is at the second line and the content starts after 
“<! — honbun start — >” which means the start of the body. Short colored 
substrings in the body are “<p>” tags with the beginnings or ends of sentences. 
These substrings decrease the recall. Ends of sentences appear frequently in a 
Japanese sentence and beginnings are frequent expressions in articles such as 
“prime minister (Koizumi)” . 

4.3 Articles from Different Sites 

We use “Los Angeles Times (http://www.latimes.com/)” and “The Sankei 
Shimbun (http://www.sankei.co.jp/)” in addition to sites described in Sec- 
tion 4.2. Two sets of articles collected from the URLs are denoted by LATIMES 
and SANKEI, respectively. Articles in LATIMES and SANKEI are written in 
English and Japanese, respectively. 

In this section, the following data sets are considered. Any combination of 
English and Japanese sites is included. 

WP-LA articles from WPOST (30 articles 2.36M Bytes) and LATIMES (40 
articles 2.44M Bytes) 

SA-YO articles from SANKEI (23 articles 86K Bytes) and YOMIURI (198 
articles 2.7M Bytes) 

SA-LA articles from SANKEI (23 articles 86K Bytes) and LATIMES (150 ar- 
ticles 4.5M Bytes) 

Note that the size of data set SANKEI is too small. 

Table 1 shows outputs of FindOptimal and evaluation values. A cell x{y,z) 
of an evaluation value shows the total value x on all articles and two evaluation 
values y and 2 ; on each site articles. The total value is not the average of the 
corresponding two single sites. It is calculated by just counting letters of all 
articles from both sites. That is, if two evaluation values of single site are x\jy\ 
and x^lyi-, then the total evaluation is {x\ + X 2 )/{yi + 2 / 2 )- Therefore, in data 
sets SA-YO and SA-LA, the evaluation values are dominated by those of the 
large data set, YOMIURI or LATIMES, respectively. 



Table 1. Evaluation values for different sites data 



Data Set 


Optimal 


Accuracy 


Recall 


Precision 


WP-LA 

SA-YO 

SA-LA 


(28, 17) 
(19,13) 
(23,6) 


0.965 (0.969, 0.962) 
0.994 (0.892, 0.997) 
0.959 (0.914, 0.961) 


0.802 (0.906, 0.679) 
0.889 (0.697, 0.930) 
0.958 (0.992, 0.954) 


0.894 (0.859, 0.954) 
0.987 (0.965, 0.990) 
0.755 (0.796, 0.749) 



Articles of the same site have the same formats. Data sets in Table. 1 have 
different formats since documents are collected from different sites. FindOptimal 
detects different formats as useless parts with high accuracy. Especially, it detects 
useless parts of articles in SANKEI despite of its small size. 
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4.4 Discussion 

Many text mining algorithms assume that frequent patterns, substrings, rules, 
etc. are important. But, in this paper, frequent n-gram is defined to be useless. 
FindOptimal avoid treating an important keyword as a useless substring in the 
following way. FindOptimal increases the length n and frequency a from the 
initial cut point (2,1). Therefore, a frequent n-gram turns out to be long. In 
fact, n = 24 when WPOST is the input. We think that an important keyword 
is not so long and a substring for the structure and style is relatively long. 
Even if some important keywords are long, it merely happens that they appear 
frequently. Thus, the algorithm does not judge an important keyword to be 
useless. 

This simple idea sometimes does not work well if the size of given input 
documents are not enough. For example, in YOMIURI data set, there exist some 
news files of Mr. Tsuta’s death. He was a famous high-school baseball manager. 
In these news files, the substring “a former manager of the baseball club at 
Tokushima prefectural high-school” appears frequently. In Japanese, the length 
of the substing is 11, and the length of the cut point found by FindOptimal is 8. 
So, FindOptimal treats the substring to be useless although the substring is not 
for the structure or style of the documents. This type of errors does not happen 
when enough articles are given to FindOptimal because the frequency of such 
long substring become to be relatively low. 

Note that useless parts do not consist of only tag sequences. For example, 
articles in YOMIURI have the same string in <title> tag and the string is 
detected as a useless part. On the other hand, the title of an article in WPOST 
is the same as the headline of the article and the title is detected as a non-useless 
part. 

5 Conclusion 

We introduced a new static value alternation count and developed an algorithm 
that divides each document of given documents into two parts, useless parts and 
non-useless parts. It is based on a simple assumption that frequent n-grams are 
not useful. We defined an optimal cut point to decide an appropriate pair (n, a) 
of length n and frequency a. 

The algorithm does not depend on natural and markup languages because it 
counts substrings of inputs instead of words or other grammatical units. Exper- 
imental results show that the algorithm is robust for noise. Moreover, if input 
documents are collected from different sites and have different formats, an out- 
puted cut point divides documents of both sites into useless and non-useless 
parts with hight accuracy. 

We only showed experiments on news articles provided as HTML files. It is a 
future work to use other types of data sets such as dynamic pages, static pages 
except for news, files written in other markup language, etc. 

It is an interesting future work to use some knowledge on grammars of the 
markup language. For example, when the algorithm enumerates n-grams, if a 
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delimiter such as and is contained in an n-gram, they must 

be at the beginning or end of the n-gram. This may improve accuracies. 

As an application of the algorithm, we developed a record extraction system 
SCOOP [16]. A record extraction is an important application of Web mining [6, 
9,12]. SCOOP utilized FindOptimal as the preprocessor and knowledge that a 
delimiter of a record (or field) ends with “>” and begins with “<”. Ginve an 
output of FindOptimal, SCOOP searches delimiters only near boundaries of 
non-useless parts and outputs the most frequent pair of substrings as a delimiter 
if it is unique on each record. 

Another challenging future work is to apply our algorithm to genome info- 
matics. The longest common subsequence problem is, given two string, to find a 
longest common subsequence of them [2,8]. The problem for k {k > 2) strings 
is known as multiple sequence alignment, which is a major problem in genome 
infomatics [11]. If k is not fixed, multiple sequence alignment is known to be 
NP-complete [13]. Both multiple sequence alignment and our problem are to ex- 
tract parts common to a given set of strings. A difference between two problems 
is that each common part should have particular length in our setting, that is, 
our algorithm does not work well if common parts are too short. 
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Abstract. Inference to the best explanation, IBE, (or abduction) re- 
quires finding the best explanatory hypothesis, from a set of rival hy- 
potheses, to explain a collection of data. The notion of best, however, is 
multicriterial and the available rival hypotheses might be variously good 
according to different criteria. Thus, one can view the abduction problem 
as that of choosing the best hypothesis from among a set of multicrite- 
rially evaluated hypotheses - i.e as a multiple criteria decision making 
problem. In the absence of a single hypothesis that is the best along all 
dimensions of goodness, the MCDM problem becomes especially hard. 
The Seeker-Filter-Viewer architecture provides an effective and natural 
way to use computer power to assist humans to solve certain classes 
of MCDM problems. In this paper, we apply an MCDM perspective to 
the abductive problem of red-cell antibody identification and present the 
results obtained by using the S-F-V architecture. 



1 Introduction 

Abductive inference is a ubiquitous form of reasoning in science and common 
sense. Abduction has been referred to as inference to the best explanation 
by Harman [3] and as the explanatory inference by Lycan [4]. Typically the 
available evidence is insufficient to narrow conclusively to single explanations. 
So, multiple hypotheses are available and the problem becomes one of choosing 
the best among rivals. Josephson & Josephson [2] have described abductions as 
following this pattern: 

D is a collection of data (facts, observations, givens) 

H explains D (would, if true, explain D) 

No other hypothesis can explain D as well as H does. 



Therefore, H is probably true. 

They also suggest that the judgment of likelihood associated with a conclu- 
sion should depend upon a number of considerations. Apart from how good a 
single hypothesis is by itself, it is also desirable that it decisively surpass the 



K.P. Jantke and A. Shinohara (Eds.): DS 2001, LNAI 2226, pp. 128—140, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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alternative hypotheses. However, there are in general, multiple kinds of criteria 
by which hypotheses may be compared. Explanatory power and plausibility are 
examples. Thus, we may view abduction as requiring a choice among the mul- 
ticriterially evaluated hypotheses, that is a species of multiple criteria decision 
making. 

MCDM problems have been widely studied across diverse fields and many 
techniques abound for solving MCDM problems [5]. An important concept in 
MCDM is the idea of dominance. Dominance is very much like an all-other- 
things-being-equal kind of reasoning. Specifically, we say that some multicrite- 
rially evaluated alternative A dominates another alternative B if there is some 
criterion in which A is strictly better than B and there is no criterion in which 
B is strictly better than A. An alternative that is not dominated is called a 
Pareto Optimal alternative. For a given problem, the set of Pareto Optimal al- 
ternatives has the property that, within the set, the only way to improve along 
any dimension is to accept a loss in another dimension. That is, choosing among 
the Pareto Optimal alternatives is a matter of making trade-offs. It is known 
that the size of the Pareto-optimal set is typically a very small percentage of 
the actual number of alternatives [6] [7]. Thus, the application of dominance as 
a filter can be expected to considerably reduce the number of alternatives which 
need to be considered [1]. 

It is worth noting is that there is no loss incurred in the elimination of the 
dominated alternatives unless significant criteria have not been considered. This 
is because we know that for every alternative eliminated by the dominance filter, 
there is at least one Pareto-optimal alternative that dominates it and is therefore 
multicriterially better than it. The application of dominance minimally requires 
that an order relation hold among values for each criterion. The survivors of the 
dominance filter represent the multicriterially maximal subset of alternatives 
from the original set. 

From the definition of dominance it is clear that, for each pair of alternatives 
that survive the dominance filter, they outperform each other according to dif- 
ferent criteria. In other words, if alternatives A and B are in the Pareto Optimal 
set, then it must be the case that there is at least one criterion in which A is 
better than B and that there is at least one criterion in which B is better than 
A, thereby preventing either from dominating the other. 

In an abduction problem, a more plausible hypothesis. Hi, might not explain 
as much as a less plausible one, H 2 - That is, H 2 is better according to explanatory 
coverage while Hi is better according to the criterion of plausibility. In such a 
case, there is no obvious sense in which either Hi or H 2 can be said to be 
a distinctly better hypothesis. However, depending upon the need to explain 
more, and upon the degree of confidence that is needed for the final choice, 
a choice between Hi or H 2 may become possible. The choice from among the 
Pareto Optimal set requires that trade-offs be accepted between plausibility and 
explanatory coverage. This can be a challenge since such trade-off judgments are 
often a function of the specific values at hand. For example, a certain level of 
confidence or of explanatory power may be sufficient. 
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In summary, a general way to solve an MCDM problem is to apply the domi- 
nance filter and then allow for choice from among the set of dominance survivors 
by applying human trade-off judgments with respect to the various criteria. The 
Seeker- Filter- Viewer architecture described in [1] is based on this strategy for 
solving the MCDM problem. The Seeker is a module which generates applicable 
alternatives and produces evaluations for them according to the different crite- 
ria. The Filter uses the principle of dominance to produce the Pareto-optimal set 
from the generated and evaluated set of alternatives. It eliminates the distinctly 
suboptimal alternatives. The Viewer allows a human to express his trade-off 
judgments on the Pareto-optimal alternatives. The Viewer allows a user to view 
the candidate alternatives as points in graphs with the criteria as axes. If multiple 
criteria need to be considered, the Viewer will provide multiple interlinked 2-D 
plots and histograms. The human expresses preferences by selecting desirable 
regions in the graphs. The graphically selected points or regions are cross-linked 
across all the open plots so that a selection made on one plot shows the values 
of the selected alternatives according to the other criteria. 

Apart from explanatory coverage and plausibility, we will describe several 
other criteria that can generally be used to evaluate candidate hypotheses in 
abduction problems. These criteria may or may not apply depending upon the 
problem domain and other characteristics of the data. We will briefly describe 
the S-F-V architecture and as an illustration both of viewing abduction from an 
MCDM perspective, and a demonstration of applying the S-F-V architecture, 
we will present the results of experiments in the domain of red cell antibody 
identification as described in [2]. We will describe the antibody identification 
problem as an abduction problem, and define the evaluation criteria used in the 
experiment. Finally, we will show the results of viewing this abduction problem 
as an MCDM problem and applying the S-F-V architecture to help solve the 
problem. 



2 The Seeker-Filter-Viewer Architecture 

The S-F-V architecture is described in detail in [1] and [9]. In this section, we pro- 
vide a brief overview of the architecture and its use in solving MCDM problems. 
Essentially, the architecture is composed of three modules, the Seeker, theFilter, 
and the Viewer, each designed to perform a specific set of functions involved in 
solving the given MCDM problem. We next describe these components one at a 
time: 



2.1 The Seeker 

The Seeker is responsible for the generation of the choice alternatives for the 
MCDM problem. In case the choice alternatives are already present or supplied 
by the decision-maker, the Seeker makes these choice alternatives accessible to 
the Filter by reading them from the database. For problems where the decision- 
maker cannot provide the choice alternatives himself, it is the function of the 
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Seeker to seek out the alternatives from whatever sources are available, in a form 
that can be used by the Filter. Abstractly, this could be a search on the Internet 
looking for choice alternatives pertaining to the problem. The Seeker described 
in [1] is currently capable of generating choice alternatives as compositions of 
various components listed in a component library. The Seeker instantiates all 
possible choice alternatives that can be formed by some distinct composition of 
a set of components in the library. Having instantiated the choice alternative, it 
next makes use of simulation models to evaluate various property values for the 
choice alternatives. For example, for an instantiated car, the Seeker might run 
simulations to compute the mileage, cost, weight, top-speed and other properties 
related to cars, for which simulation models are available. At the end of the 
generation process, the Seeker produces a list of choice alternatives along with a 
set of {property-name, property-value} pairs for each alternative. It makes this 
list available to the Filter. 

2.2 The Filter 

The Filter is responsible for applying the dominance rule to the set of alternatives 
generated by the Seeker. In order to do this, the Filter expects the decision-maker 
to choose those properties of the choice alternatives which reflect the dimensions 
of outcomes that matter to him, and additionally the directions of goodness for 
the criteria. For example, if the decision-maker desires to buy a car that is cost- 
effective to him, he should choose cost and mileage as properties of interest to 
him. Once such a set of properties have been selected by the decision-maker, the 
Filter uses these properties as criteria based on which to apply the dominance 
rule on the set of alternatives. Since the criteria values for the chosen criteria 
are already made available by the Seeker, the Filter makes use of these values 
to produce the Pareto-optimal set of alternatives. As mentioned earlier, this 
step is essential because alternatives not belonging to the Pareto-optimal set are 
known to be dominated by some Pareto-optimal alternative. As a result, there is 
no loss incurred in eliminating such alternatives. By doing so, the Filter prevents 
the decision-maker from having to even consider such alternatives, and thereby 
unintentionally select a suboptimal alternative. Finally, as indicated in [6], [7], 
the Pareto-optimal set often tends to be a very small fraction of the original set. 
Hence the application of the Filter also reduces the size of the set of alternatives 
that further need to be considered. While the Filter can reduce the relevant set of 
alternatives from a large number to a small fraction, choosing an alternative even 
from a handful of Pareto-optimal alternatives can be a demanding task for the 
decision-maker. The next module of the architecture allows the decision-maker 
to graphically interact with the Pareto-optimal alternatives in various ways, in 
order to select the final choice alternative (s) of interest to him. 

2.3 The Viewer 

As mentioned previously, choice among Pareto-optimal alternatives requires the 
making of tradeoffs. The Viewer allows the decision-maker to interact with the 
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Pareto-optimal set by means of various kinds of graphical plots which enable the 
decision-maker to express his tradeoff preferences in the context of the available 
alternatives. A more detailed description of all modes of interaction that the 
Viewer allows, along with a description of an interaction session between the 
Viewer and a decision-maker is provided in [9]. Here will only mention that 
the Viewer allows the decision-maker to plot the Pareto-optimal alternatives as 
points in 2-D scatter plots where the axes of the plots can be selected by the 
decision-maker himself; he can further pull up as many plots as he desires. The 
Viewer also maintains a set of 1-D plots where the Pareto-optimal alternatives 
are plotted along single property-axes. The Viewer allows the decision-maker 
to select points or collections of points by enabling graphical selection of such 
points. Upon selection by the decision-maker, all points within the selected region 
are indicated using a separate color and moreover such indication is provided 
across all the open points. Thus ,even though the decision-maker makes his 
selection on a single plot, he gets to examine the implications of his selection in 
terms of the other properties by examining the colored points on all other plots. 
This forces the decision-maker to make selections and at the same time evaluate 
the consequences of the selection. It is expected that this will lead to a more 
rational selection process. Apart from making tradeoffs, the Viewer enables other 
kinds of preference expression by the decision-maker. These include: choosing 
alternatives by categories from bar-charts, applying hard-constraints based on 
criteria by using the 1-D plots, applying various kinds of constraints based on 
as yet unconsidered properties, combining alternatives that belong to different 
Viewer-based selections of the decision-maker, looking at a list of all properties of 
alternatives in the selected region in a tabular form, and so on. Thus, the Viewer 
complements the Filter by enabling many kinds of preferences that apply when 
choosing from Pareto-optimal alternatives. 

The synergy between the three modules of the S-F-V architecture provides 
it with the ability to act as an effective decision support for solving MCDM 
problems. As an indication of its effectiveness, we point to the experiment de- 
scribed in [1] where close to 2 million choice alternatives (Hybrid vehicles) were 
generated by the Seeker, the Filter reduced this set to 1078 alternatives, and in- 
teraction between a decision-maker and these Filter survivors using the Viewer 
resulted in a final output of 7 alternatives. The architecture has been applied to 
a number of engineering problems and our claims about the effectiveness of the 
architecture are based on the response we received from the users regarding the 
ease with which they were able to use the architecture. We realize that a formal 
usability analysis of the architecture would go a long way towards establishing 
this. We have compared the Viewer with a few other alternative visualization 
techniques in MCDM literature and our impression is that the Viewer has its 
own set of unique properties. We direct the interested reader to a survey of such 
visualization techniques that occurs in [10] (pp. 238-249). 

This brings our description of the S-F-V architecture to a close. We next 
describe some properties of explanatory hypotheses, which can be used as evalu- 
ation criteria for the hypotheses. The use of such criteria to evaluate hypotheses 
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will allow hypotheses to be viewed as multicriterially evaluated alternatives, 
thereby allowing the problem of choosing the “multicriterially best” alternative 
to be seen as an MCDM problem. 

3 Evaluation Criteria for Explanatory Hypotheses 

As we said, the idea of the best hypothesis from among a set of hypotheses 
is a multicriterial notion. In [8] the following qualities are suggested as criteria 
for evaluating hypotheses: Explanatory Power, Plausibility, Internal eonsisteney. 
Simplicity, Specificity, Predictive Power, and Theoretical Promise. In order to 
apply MCDM techniques it will be necessary that the evaluations according to 
the criteria can be obtained in a numerical form, or some other form that enables 
the comparison of criterion values, so that it is conducive to the application 
of MCDM techniques. This may well depend upon the domain for which the 
abduction problem is being solved. As an illustration of how this can be done, 
we next describe how evaluations were produced for the hypotheses in red cell 
antibody identification domain. 



4 The RED Domain: The Red Cell Antibody 
Identification Task 

As described in [2], the RED systems are medical test-interpretation systems 
that operate in the knowledge domain of hospital blood banks. Specifically, the 
RED systems are meant to help in the problem of red-cell antibody identification. 
We will first briefly describe the problem and then formulate the problem as an 
abduction problem. 



4.1 The Problem 

Before blood transfusion is carried out it is imperative to check that the donor’s 
blood matches the patient’s blood. The process of matching involves ensuring 
that the donor’s blood does not contain antigens which would be identified as 
foreign bodies by the patient’s immune system. If the immune system does en- 
counter foreign bodies, it produces antibodies directed against them. The an- 
tibodies that are produced by the patient’s blood against red cell antigens of 
a donor are called red-cell antibodies. If the patient’s blood contains antibod- 
ies directed against the red cell antigens of the donor’s blood, this is a case of 
mismatch. Transfusion of badly matched blood could result in many bad conse- 
quences including fever, anemia, and life threatening kidney-failure. Hence the 
red cell antibody identification task is of crucial importance to blood banks. In 
addition to the familiar A, B, and Rh, more than 400 red-cell antigens are known. 
Once the blood has been tested to determine the patient’s A-B-0 and Rh blood 
type, it is necessary to test for the presence of antibodies directed toward other 
red-cell antigens. 
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Table 1. Red-cell test panel. The various test conditions, or phases, are listed along the 
left side (AlbuminIS, etc.) and identifiers for donors of the red cells are given across the 
top (623A, etc.). Entries in the table record reactions graded from 0, for no reaction, to 
4+ for the strongest agglutination reaction, or H for hemolysis. Intermediate grades of 
agglutination are -f/- (a trace of reaction), l-|-w(a grade of H-, but with the modifier 
“weak”), 1-f, l-|-s(the modifier means “strong”), 2-|-w, 2+, 2-|-s, 3-|-w, 3-I-, 3-l-s, 4-|-w. 
Thus, cell 623A has a 3-1- agglutination reaction in the Coombs phase. 
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Typically this identification is performed by using one or more reaction pan- 
els of the form shown in Table 1. The columns in the table refer to different 
applicable donors, while the rows refer to different test conditions. Each entry 
in the table indicates reactions shown by a mixed sample of the patient’s blood 
serum and the indicated donor’s red blood cells, under the specified test condi- 
tions. These figures are produced by the blood bank technologist to indicate his 
visual assessment of the strength and type of reaction. Possible reaction types 
are agglutination (clumping of cells) or hemolysis (splitting of the cell walls) . The 
strength of the reactions are expressed in the blood-banker’s vocabulary, some 
terms of which are shown in Table 1, and consists of thirteen possible reaction 
strengths. Hemolysis reactions were ignored for purposes of this experiment. All 
3-|- entries are converted into the number 3 for our experiment. Similarly, the 
l-b values are converted to number 1 and so on. Reactions indicated as 2-|-s are 
converted into the number 2.5 while those marked as 2 -|- w are converted into 
the number 1.5. 

Additionally, information about the significant antigens present in each of 
the donor samples are recorded in a table called the antigram. By reasoning 
about the pattern of reactions displayed by the reaction panel and using the 
antigen information present in the donor antigram, the blood-bank technologist 
attempts to determine which antibodies are present in the patient’s serum and 
are causing the observed reactions and which are absent, or at least not present 
in enough strength to cause reaction. The RED systems were built to automate 
this reasoning process. 

4.2 The Red Cell Antibody Identification Problem as an Abduction 
Problem 

The reaction panel shown in Table 1 can be considered as data to be explained. 
Using the antigrams which give information about the various antigens present 
in the donor samples, it is possible to construct hypotheses about the existence 
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of various antibodies in the patient’s serum. Each such hypothesis will contain 
two kinds of information - 

— A profile similar to Table 1 representing how much this particular hypothesis 
can offer to explain for each of the reactions in the panel. This is the most 
that can be consistently explained by the hypothesis. 

— A plausibility value, which is the result of applying rules given by domain 
experts, to the data of the case. In our experiment, this value is an integer 
between -3 to +3, representing the plausibility on a symbolic scale from 
“ruled out” to “highly plausible” . 

An example for a certain antibody is given in Table 2. It shows how much of the 
reactions shown for the case from Table 1 are accounted for by hypothesizing 
that the antibody, AntiNMixed, is present in the patient’s serum. Also, the 
plausibility value for the hypothesis is indicated to be -2. 



Table 2. Reaction profile for an individual antibody(Anti NMixed) hypothesis. Note 
by comparison with the overall reaction panel in Table 1 that the hypothesis only offers 
to partially explain some of the reactions. 
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Table 2 does not contain as many columns as Table 1 because the hypothesis 
cannot explain any of the reactions pertaining to those columns. The same kind 
of profile is created for all of the other non-ruled out antibodies. Hence given a 
donor, the following inputs are present: 

1. The reaction panel as indicated in Table 1 

2. A plausibility value for each antibody. 

3. A reaction profile for each antibody for which the plausibility value is not 
-3, i.e. it has not been “ruled out.” 

The desired output will be a set of antibodies which best explain the reac- 
tions, along with plausibility values associated with them. The above problem 
can now be seen as an abduction problem with the following mapping: 

1. The reaction panel represents the data, D, to be explained. 

2. The individual antibodies which have not been “ruled out” and all possible 
composite hypotheses that can be generated from them represent the set of 
possible explanatory hypotheses, the set E. 
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The abduction problem is one of finding the hypothesis which best explain the 
reactions in the reaction panel. However, sometimes the evidence will be insuf- 
ficient and there will be no unique, best explanation. 

5 Evaluation Criteria for Hypotheses in the RED Domain 

In this section, we will describe how the evaluation criteria for the hypotheses in 
the RED domain were computed from the given information. For a given prob- 
lem, the set E of all possible explanatory hypotheses was created as described 
next. 

Firstly, the set of antibodies which are ruled out(i.e. with plausibility values 
- 3 ) are no longer considered as potentially explanatory hypotheses for the prob- 
lem. Such hypotheses are excluded from set E. The set, S, of simple hypotheses 
may be defined as follows: 

S = { Ai : Ai hypothesizes the presence of a particular 
antibody and the plausibility of Ai is not — 3} 

Now, the set E of applicable hypotheses is defined as the set of all possible 
hypotheses obtainable as combinations(conjunctions) of the simple hypotheses 
in S. 

The set C = E — S is therefore the set of all composite hypotheses which hy- 
pothesize the presence of more than a single antibody to explain the reactions. 
Thus, if we suppose Hi, H2, A3, ■■ ■ ,Ak to be each of the individual hypotheses 
related to the presence of single antibodies which have not been ruled out in 
advance, then the hypothesis A4 is an example of a simple hypothesis while the 
hypothesis {^2,^3,^^,} is an example of a 3 -part composite hypothesis. Obvi- 
ously, the size of the largest composite hypothesis is k and it includes all of the 
simple hypotheses in the set S as its parts. Next, we will discuss how some of 
the criteria mentioned in Section 3 were computed for the set E above. 

1 . Explanatory Power: Given the values in the reaction panel and the reaction 
profiles, Ri, for simple hypotheses, Ai, one way to quantify the explanatory 
power of a simple hypothesis is to compute the sum of all the values in 
its reaction profile table. Since each individual entry in the reaction profile 
offers to explain an observed reaction as consistently as possible, the sum 
of the reaction profile matrix is indicative of the overall explanatory power 
of the simple hypothesis. This value is used as a heuristic measure of the 
explanatory power of the simple hypothesis. For a composite hypothesis, the 
reaction profile is constructed by using the profiles for its parts. That is, the 
reaction profile for a composite hypothesis is constructed as the entry-wise 
sum of reactions in the individual reaction profiles, with a maximum for any 
entry of the reaction strength that needs to be explained. More formally the 
Explanatory Power, S, was computed as, 

\/H e E,g{H) = E R{a,b) 

a,b 



( 1 ) 
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2. Implausibility: The plausibility, pi, for each of the simple hypotheses, Ai, 
is already available as a part of the input. Since -3 is the lowest degree of 
plausibility that is assigned to a simple hypothesis, the implausibility can be 
computed by a heuristic measure which produces low value of implausibilities 
for high value of plausibilities and so on. The exact form of this function is 
not important since the values themselves are meant to be used only for 
relative comparisons. In our experiment, implausibility, I, was computed as, 

^HeE,I{H)= ^ (4-pj) 

AjeH 



3. Simplicity: There are at least two ways to define this criterion. 

a) Cardinality: Simplicity can be defined in terms of the number of parts 
in a hypothesis, that is its cardinality. Note that we would want to min- 
imize this value in order to maximize simplicity. However, this measure 
provides a better score for a hypothesis like {A 2 ,Aq} relative to another 
hypothesis like {Ai, A 3 , Af} based merely on the differences in their 
structural simplicities. 

b) Inclusion simplicity: This measure cannot be quantified on a per hypoth- 
esis basis like the previous ones. However, when comparing two compos- 
ite hypotheses, say Hi and H 2 , we say that Hi is better in inclusion- 
simplicity than H 2 if and only if all of the constituent parts of Hi are 
present in H 2 as well. In all other cases, the two hypotheses are consid- 
ered incomparable in simplicity. This measure makes sure that the least 
complex hypothesis is preferred to a more complex one that explains no 
more. 

For the RED domain experiment, only two of the above criteria were used. This 
is because the implausibility value, as defined previously, already carries the 
information carried by the inclusion-simplicity criterion. The addition of another 
hypothesis to a composite will always reduce its plausibility. So an included 
hypothesis will always be more plausible than the including one. Consequently, 
if k simple hypotheses are not ruled out in advance, then the abduction problem 
involves as many hypotheses as the total possible combinations that result from 
k, simple hypothesis. In other words, the problem becomes a (2^ — 1) alternative, 
2-criteria, MCDM problem. In the next section, we discuss the results of applying 
the S-F-V architecture to this MCDM problem. 



6 Results of Applying the S-F-V Architecture 

It is to be noted that the potential number of explanatory hypotheses in the RED 
domain is exponential in the number of simple hypotheses that have not been 
ruled out at the onset. Considering that up to 30 clinically significant red-cell 
antigens are known, the total number of alternatives is potentially quite large. 
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Hence, the ability of dominance filter to prune effectively becomes valuable in 
reducing the complexity of the problem. The results shown below are for the 
case labeled OSU-9 in the RED domain as described in [2]. The reaction panel 
shown in Table 1 refers to the same case. 

This case resulted in 15 simple hypothesis which could not be ruled out 
based on the evidence at the outset. Thus we have a total of 32,767 potential 
explanatory hypotheses, a formidable number. The Seeker generates this set by 
building the exhaustive set of combinations starting with 15 simple hypotheses. 
In the process of generation, the Seeker also evaluates the hypotheses along the 
various criteria using the heuristic measures indicated in the previous section. 
It now makes this set of multicriterially evaluated hypotheses available to the 
Filter. The Filter, after applying the dominance rule using implausibility and ex- 
planatory power as the criteria produces a Pareto-Optimal set containing only 
3 hypotheses! In other words, as long as the goal is to find the most plausible 
hypothesis which explains the reactions the best, there is no need to consider 
the remaining 32764 eliminated alternatives; dominance ensures that they are 
inferior to the survivors. The remaining 3 surviving hypotheses are plotted as 
points in a Viewer scatter plot shown below with the Implausibility and Ex- 
planatory Power as the axes. The labels for each point show the composition 
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Fig. 1. Plot showing the 3 survivors of dominance applied to the case OSU-9 from the 
RED domain 



of the individual hypotheses. We see that of the 3 survivors, one is a simple 
hypothesis and in fact this hypothesis, H8 occurs in each of the remaining two 
composite hypotheses, {H8,H5} and {A8, A12}. 

Figure 1 also shows the trade-offs available to the user of such a system. Such 
a trade-off is typical of many abduction problems where the ability to explain 
more comes with a cost in the confidence associated with the explanation. By 
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using this plot, which is displayed by the Viewer in the S-F-V architecture, 
the user can exercise his trade-off judgments by selecting the point of interest 
to him. For example, to get greater explanatory coverage than that provided 
by the simple hypothesis, the user is informed from the plot that he will need 
to incur an increase in the implausibility. The composite {^8,^12}, shown as 
the middle point in the plot, allows for one step of trade-off in the direction 
of explaining more, with a resulting increase in the implausibility. Similarly, 
the point to the extreme right and top explains the most but is also the most 
implausible among the three potential explanatory hypotheses. Figure 2 shows 
similar trade-offs for another experimental case. This plot shows more clearly, 
how moving from the leftmost point to the next point results in a considerable 
increase in explanatory coverage while the resultant increase in implausibility 
is not as large. Conversely, looking at the rightmost pair of points, we see that 
a very small increase in explanatory coverage is obtained by incurring quite 
large increase in implausibility. This illustrates how the different kinds of trade- 
off judgments can be brought upon to choose between competing hypotheses 
even if they are both Pareto Optimal. This plot also shows how choice between 
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Fig. 2. Plot showing the survivors of dominance applied to the case Pat-32 from the 
RED domain 



multicriterially best explanations involves trade-off. The choice of an appropriate 
hypothesis will depend upon the user’s (in this case the person administering 
the blood) willingness to hypothesize the presence or absence of an antibody 
according to the urgency of the situation and other risk based considerations. 
Alternatively, if additional knowledge becomes available at a later stage of the 
problem, this may be used to rule out some of the surviving hypotheses. 
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7 Conclusions 

We have shown how the MCDM perspective applies to abductive reasoning. 
IBE problems are inherently multicriterial. These criteria need not be commen- 
surable. Even if that is the case, a well-defined notion of multicriterially best 
explanations can be given. Such best explanations need not be unique. However 
computer-aided visualization of the alternatives can help human to choose from 
among the multicriterially best hypotheses. It is worth noting that if there is 
indeed a single hypothesis that is the most plausible, explains the most, and so 
on, then such a hypothesis will be the sole survivor of the dominance filter (this 
is because by virtue of being the best along all of the evaluation criteria, it will 
dominate every other alternative, using the definition of dominance from page 
2). Moreover MCDM techniques can help reduce the complexity of the problem. 
One can envision scientists using powerful, computerized decision aids like the 
S-F-V architecture in the future to help solve complex problems of discovery. 
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Abstract. In the study of discovering association rules, it is regarded as 
an important task to reduce the number of generated rules without loss 
of any information about the significant rules. From this point of view, 
Bastide, et al. have proposed to generate only non-redundant rules [2]. 
Although the number of generated rules can be reduced drastically by 
taking the redundancy into account, many rules are often still generated. 
In this paper, we try to propose a method for reducing the number of 
the generated rules by extending the original framework. For this pur- 
pose, we introduce a notion of approximate generator and consider an 
approximate redundancy. According to our new notion of redundancy, 
many non-redundant rules in the original sense are judged redundant 
and invisible to users. This achieves the reduction of generated rules. 
Furthermore, it is shown that any redundant rule can be easily recon- 
structed from our non-redundant rule with its approximate support and 
confidence. The maximum errors of these values can be evaluated by 
a user-defined parameter. We present an algorithm for constructing a 
set of non-redundant rules, called an approximate informative basis. The 
completeness and weak-soundness of the basis are theoretically shown. 
Any significant rule can be reconstructed from the basis and any rule 
reconstructed from the basis is (approximately) significant. Some exper- 
imental results show an effectiveness of our method as well. 



1 Introduction 

The discovery of association rules is an important task in the research area of 
Data Mining. Its main purpose is to identify relationships among items in a 
given large database. This kind of problem has firstly introduced by Agrawal, 
et al. [1]. According to their statement, the problem can be divided into two 
sub-problems: 

Finding frequent itemsets: 

Given a transaction database D, we try to find all frequent itemsets ^ in D. 
Generating confident association rules: 

All confident association rules are generated based on the frequent itemsets. 

^ An itemset is a set of items appearing in V. 



K.P. Jantke and A. Shinohara (Eds.): DS 2001, LNAI 2226, pp. 141—154, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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In order to solve the former problem, we would be required to search in an 
itemset-lattice consisting of 2 ™ itemsets if we have m possible items. On the 
other hand, the latter problem can be solved in a straightforward manner, once 
we have all frequent itemsets. Therefore, the former is considered primary and 
the latter, secondary in an efficient discovery of association rules. In fact, many 
studies on association rule discovery have tended to concentrate on an efficient 
computation of the frequent itemsets and many algorithms for this task have 
been proposed [ 1 , 4 , 5 ]. 

Thus, as many researchers have actually investigated, the task of finding all 
frequent itemsets is one of the important subjects in the discovery of associa- 
tion rules. However, we still have another significant issue to be addressed. It 
is concerned with the number of rules generated from the obtained frequent 
itemsets. 

In general, a large number of rules are generated and then presented to a 
user. Although it is ensured that the generated rules meet the requirements for 
support and confidence given by the user they often include many rules that 
are not so interesting to the user in fact. Therefore, the user has to check each 
presented rule carefully in order to obtain actually interesting ones. However, 
such a task is quite hard due to the large number of presented rules. In some 
cases, unfortunately, several interesting rules might be missed. Therefore, it is 
helpful for the user to reduce the number of generated (and presented) rules 
without loss of any information of possible ones. The purpose of this paper is to 
propose a method for such a reduction. 

By introducing a notion of redundancy of association rules, Bastide, et al. 
have proposed to identify only the set of non-redundant ones, called an infor- 
mative basis, and to present the basis to the user. In a word, a non-redundant 
rule can be viewed as a representative of a set of rules, each of which has ex- 
actly the same support and confidence, and it can be easily reconstructed from 
the representative. For example, assume we have the following association rules: 
ri = ^ Z2 A i3 A Z4, T2 = zi A Z2 ^ Z3 A 14 and r^ = A A Z2 A 13 ^ 14, 

where their supports and confidences are exactly identical. Given ri, the others 
can be reconstructed from ri by a quite simple operation. Furthermore, their 
precise supports and confidences can be obtained immediately. In this sense, V2 
and r3 are considered to be redundant and ri to be non-redundant Identifying 
non-redundant rules is just sufficient to obtain the possible ones. As has been 
mentioned above, since a non-redundant rule corresponds to a representative 
of a set of rules, the number of non-redundant rules is expected to be much 
smaller than one of the possible rules. By considering only non-redundant rules, 
therefore, we can drastically reduce the number of rules to be generated. 

From the author’s viewpoint, however, there often still exist many non- 
redundant rules. It might be a costly task for users to check them. Although 
we can easily reconstruct any redundant rule from a non-redundant one with its 

^ If a rule meets the requirements, we say that the rule is significant. 

® In a word, such a non-redundant rule is characterized as one with the minimal 
antecedent and the maximal consequent. 
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precise support and confidence according to the original framework, the authors 
would like to claim that 

from a practical point of view, even though we cannot surely derive the 
precise support and confidence of redundant rule, it would be worth 
reducing the number of output rules further. 

We try in this paper to propose a method for such a reduction by extending the 
original approach. Especially for this purpose, the original notion of redundancy 
is extended according to the claim above. 

Since the support and confidence of our redundant rule can be approximately 
derived from a non-redundant one according to such an extended redundancy, 
these approximate values might not satisfy some users who require a high pre- 
cision of the derived values. In our framework, therefore, we can flexibly adjust 
the maximum error by giving an adequate value of a user-defined parameter e 
(0 < £ < 1). As £ approaches 1, the maximum error increases, but the number 
of non-redundant rules decreases. Conversely, as e approaches 0, the maximum 
error approaches 0, but the number of non-redundant rules increases. 

Given a user-defined parameter e, in order to describe our non-redundancy, 
we define a set of rules w.r.t. e, called an approximate informative basis {AIB{e)). 
It will be proved that every rule r in AIB(e) has the following property: 

Any rule r' reconstructable from r has approximately the same support 
and confidence as ones of r, where the maximum errors of these values 
are evaluated by some formulas determined by e. 

Thus a rule reconstructable from r is redundant. For the same reason, such a 
rule r in AIB(e) is non-redundant, and can approximately represent any rule 
reconstructable from it. 

For any significant rule r, there always exists a corresponding non-redundant 
rule in AIB{e) from which r can be reconstructed. No significant rule can be lost, 
once AIB{e) is computed. The completeness in this sense and weak-soundness of 
AIB{e) are summarized in a theorem. We present an algorithm for constructing 
AIB{e). An effectiveness of our method is shown by some experimental results. 

This paper is organized as follows. In the next section, we introduce some 
terminologies used throughout this paper. In Section 3, we briefly explain the 
original framework by Bastide, et al. Section 4 discusses our method for con- 
structing AIB(s) with an example. Our preliminary experimental results are 
presented in Section 5. We summarize this paper and give some discussions in 
the last section. Especially, we briefly describe a new interactive strategy, which 
we are going to develop, for identifying interesting rules based on the method 
presented in this paper. 

2 Preliminaries 

Let X be a finite set of items. An itemset I is a non-empty subset of 2. A tuple 
{id, 1) is called a transaction, where id is a transaction identifier and I is an 
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itemset. A transaction database I? is a finite set of transactions. We often refer 
to itemset{id) as the itemset associated with id in a transaction. 

For a transaction t = {id,l), we say that t contains an itemset I' ii I' C 1 . 
Given a transaction database T>, the support of an itemset I, denoted by sup{l), 
is defined as the ratio of the number of transactions containing I to the number of 
all transactions in T>. Let minsup be a user-defined threshold for the permissive 
minimum support. An itemset I is called a frequent itemset if sup{l) > minsup. 

An association rule r is an implication between two itemsets which is of the 
form r = /i ^ (?2 \ ^i), where li and I2 are itemsets such that li C h- The 
support ofr, denoted by sup{r), is defined as sup{r) = sup{l2). Furthermore, the 
confidence ofr, denoted by confer), is defined as conf{r) = sup{l2) / sup{li). 
Let minconf be a user-defined threshold for the permissive minimum confidence. 
An association rule r is said to be significant if sup{r) > minsup and conf{r) > 
minconf. 

Given a transaction database T>, let TV be the set of transaction identifiers in 
V. We consider a mapping if : 2 ^ ^ 2^® that is defined as if{l) = {id \ {id, I') G 
V A I C /'}. Moreover, we consider a mapping ip : 2^® ^ 2^ that is defined 
as = f]^^^jjjitemset{id). Based on these mappings, a closure operator 

7 : 2^ ^ 2^ is defined as ^{l) = (p{if{l)), that is, 7 computes the maximum 
itemset that is shared with all transactions containing 1 . 

We say that an itemset I is closed if 7(?) = 1 . Since 7(7(0) = 7(0 holds for 
any itemset I, y(0 is a closed itemset. It should be noted that for any itemset I' 
such that I C I' C 7(0, j{l') = 7(0 and sup{l) = sup{l') = sup{'j{l)) hold. 

An itemset I is called an exact generator {E -generator) of 7(0- For a frequent 
closed itemset /, we refer to the set of A-generators of / as EG{f) and the set 
of minimal E -generators of / as MEG{f), that is, MEG{f) = { g \ g & 
EG{f)A ^g' G EG{f) such that g' <Z g}. For a frequent closed itemset / and its 
A-generator g G MEG{f), a tuple {g, /) is called an EGG -tuple. Given an EGC- 
tuple {g,f), for any itemset I such that g G I Q f, sup{g) = sup{l) = sup{f) 
holds. The set of AGC-tuples w.r.t. V is referred to as EGG{V). 

3 Informative Basis of Association Rnles 

In this section, we briefly introduce a method of reducing the number of gen- 
erated rules [2]. The key notion of this approach is a redundancy of association 
rule. 

Definition 1. (Redundancy of Association Rule) [2] 

Let r = /i — > (?2 \ ^i) be an association rule, r is called a redundant rule iff there 
exists an association rule r' = l[ {I2 \ I'l) such that l{ C li, I2 C Z2, r' yf r, 
sup{r') = sup{r) and conf{r') = conf{r). ■ 

Intuitively speaking, a redundant rule r is a rule which has exactly the same 
support and confidence as ones of some non-redundant rule r' and can be eas- 
ily reconstructed from r' by a simple operation on itemsets. Therefore, a non- 
redundant rule can be viewed as a representative of a set of redundant ones. This 
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implies that extracting only non-redundant rules can be considered sufficient for 
the discovery of all possible rules. Since it is obvious that the number of non- 
redundant rules is smaller than that of all rules, we can reduce the number of 
rules to be obtained by simply taking non-redundant ones into account. 

Each non-redundant rule is characterized as a rule with the minimal an- 
tecedent and maximal consequent and is formally defined in terms of E-generator 
and closure. It is shown that any rule can be reconstructed from a non-redundant 
rule with its precise support and confidence. 

Furthermore, some experimental results show that the number of non- 
redundant rules is much smaller than that of all possible rules. Therefore, the 
method can be considered effective and promising in order to reduce the number 
of rules to be generated. However, there often exist a large number of non- 
redundant rules even though all redundant ones are discarded. Since the task of 
checking them would be still costly for users, more reduction is strongly desired 
to assist the user’s task. 

In the next section, we try to propose a method for such a reduction by 
extending the original approach. 



4 Approximate Informative Basis of Association Rules 

As just mentioned, we still have a large number of rules even if we consider 
redundant ones to be unnecessary. Although we can easily reconstruct any re- 
dundant rule from a non-redundant one with its precise support and confidence 
according to the original framework, the authors would like to claim that 

from a practical point of view, even though we cannot precisely derive the 
supports and confidences of redundant rules, it would be worth reducing 
the number of output rules further. 

In this section, we try to propose a method for reducing the number of rules to 
be generated. Especially for this purpose, the original notion of redundancy is 
extended according to the claim above. In order to present our method, we first 
introduce a notion of approximate generator. 

4.1 Approximate Generators of Closed Itemsets 

An approximate generator is an extension of E-generator and it can work more 
flexibly. 

Definition 2. (A- Generators) 

Let I be an itemset and / a closed itemset. I is called an approximate generator 
{A-generator) of / if 7 (/) C / and sup{f) / sup{'-^{l)) > 1 — £, where e is a 
user-defined parameter (0<£<1). ■ 



Note that any E-generator of a closed itemset / is an A-generator of /. 
The following property plays a very important role in our method. 
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Proposition 1. 

Let g be an A-generator of a closed itemset /. For any itemset I such that 
sup{g) > sup{l) > (1 — e)sup{g) 



and 

sup{f) /(!-£)> sup{l) > sup{f). 



Proof. 

From the definition of ^-generator, 1 > sup{f) / sup{”f{g)) > 1 — £ holds. 
Since sup{'j{g)) = sup{g), we have sup{g) > sup{f) > (1 — e)sup{g). From 
sup{g) > sup{l) > sup{f), therefore, sup{g) > sup{l) > (1 — e)sup{g) holds. 

Based on the inequalities above, we can easily obtain sup{f) / (1 — £) > 
sup{l) > sup{f) as well. □ 



The proposition implies that sup{g) and sup{f) can be considered as approx- 
imations of sup{l) if we could accept the errors. It should be noted here that 
the maximum errors are precisely evaluated with the parameter e. Therefore, 
we can flexibly adjust the maximum errors so that they are permissible for us. 
As £ approaches 1, the maximum becomes larger. Conversely, as e approaches 
0, the maximum error approaches 0. That is, in case of £ = 0, any A-generator 
corresponds to an A-generator. 



4.2 Approximation of Tuples 

As previously mentioned, for any itemset I, the support of I can be precisely 
identified with an AGC-tuple {g, f) such that 5 C / C /, since sup{g) = sup{l) = 
sup{f). Therefore, based on the set of AGG-tuples w.r.t. T>, EGC{T>), we can 
obtain the precise support of any itemset. 

On the other hand, we define here an approximation of EGC{T>) with the 
help of A-generators. Using the approximation, we can approximately identify 
the support of any itemset with the maximum errors we just discussed. 

Definition 3. (Approximation of EGG{T>)) 

Let T be the set of frequent closed itemsets w.r.t. V and £, a user-defined 
parameter (0 < £ < 1). Consider a partition of IF, {Ei, . . . , Fk} For each 
Fi, there uniquely exists a closure f* G Fj such that ^f & Fi f C f* and 
sup{f*)/sup{f) > 1 — £. For each Fi, let us consider AGG(Fi) = {{g,ff) \ g G 
min{{Jf^p, MEG{f))} An approximation of EGG {V) is defined as 

k 

AGG{V,e) = y AGG(F,). 

i=l 

Each tuple in AGC{V,e) is called an AGG-tuple. ■ 

^ That is, T = and Fi C\ Fj = f (i A j), where each Fi is called a cell. 

® For a set S, min{S) denotes the set of minimal elements in S under the set-inclusion 
ordering. 
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From the definition, for each ifCC-tuple (g,f) € EGC{V), it is obvious 
that / uniquely belongs to some Fi and there exists an AGC-tuple G 

AGC{V, e) such that g* Q g and / C /*. Moreover, for any AGC-tuple {g*, f*), 
g* is an A-generator of f* . From these observations and Proposition 1, therefore, 
we can obtain the following statement. 

Proposition 2. 

For any frequent itemset I, there exists an AGG-tuple {g, /) G AGG{V, e) such 
that g Gl C f. Furthermore, sup{g) > sup{l) > (l — s)sup{g) and sup{f) / (1 — 
e) > sup{l) > sup{f) hold. 

Proposition 2 implies that AGC{T>, e) can identify the support of any frequent 
itemset approximately, where the maximum errors are precisely evaluated by 
functions of e. 



4.3 Approximate Informative Basis of Association Rules 

Based on the set of AGG-tuples, AGG{V,e), we can construct a basis of as- 
sociation rules, called an approximate informative basis(AIB), from which any 
significant rule can be easily reconstructed with its approximate support and 
confidence. Before giving the formal definition, we introduce a notion of approx- 
imate source of association rules. 

Definition 4. (Approximate Sources of Association Rules) 

Let I? be a transaction database, £ a user-defined parameter (0 < £ < 1) and T 
the set of frequent closed itemsets. Assume that {Fi, . . . , F^} is the partition of 
T based on which AGC(fD,s) is constructed. 

For an EGG-iwple (g,f) G EGCiV), consider an Fi such that f C f*. An 
association rule to which the pair of {g, /) and AGC{Fi) is attached, 

s = g^{f*\g) ■■ {{g,f), AGG{f,) ), 

is called an approximate source (A-source) of association rules ®. The set of 
A-sources is referred to as AS{'D,s). m 

We can reconstruct a set of association rules from an A-source. 

Definition 5. (Reconstruction of Association Rules from A-source) 

Let s = g ^ (/* \g) '■ { {g, f) , AGG{F) ) be an A-source. It is said that an 
association rule l\ {I 2 \ h) can be reconstructed from s g C l\ C f and for 
an AGG-tuple {g*,f*) G AGC{F), g* Cl^C f*. ■ 

As shown in the next proposition, for any association rule that is recon- 
structed from an A-source, its support and confidence can be within certain 
ranges determined by the values of the source and e. 

In what follows, depending on contexts, s often denotes only the rule g (/* \ g) 
of s. 



6 
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Proposition 3. 

Let s be an ^-source and r be an association rule reconstructed from s. Then 



sup{s) 
1 — £ 



> sup{r) > sup{s) 



and 



confjs) 
1 — £ 



> conf{r) > conf{s) 



hold. 



Proof. 

Let s = g ^ if*\g) ■ ( (ff, /) , AGC{F) ) be an A-source and r = li ^ (hXh) 
be an association rule reconstructed from s. From the definition of reconstruc- 
tion, g C li C f and for an AGC-tnple {g* , f*) in AGC{F), g* G I2 C f* hold. 
Note here that sup{g) = sup{li) = sup{f). Furthermore, from Proposition 1 , 
sup{g*) > sup{l2) > (1 — e)sup{g*) and sup{f*)/{l — e) > sup{l2) > sup{f*) 
holds. 

Since sup{s) = sup{f*) and sup{r) = sup{l2), we can immediately obtain 
sup{s)/{l — e) > sup{r) > sup{s). 

Moreover, since sup{g) = sup{l\) and sup{l2) > sup{f*), sup{l2) / sup{l\) > 
sup{f*) / sup{g) holds. Similarly, from sup{g) = sup{li) and sup{f*)/{l — e) > 
sup{l2), sup{f*)/{{l — e)sup{g)} > sup{l2) / sup{l\) holds. Therefore, we ob- 
tain sup{f*)/{{l — e)sup{g)} > sup{l2) / sup{l\) > sup{f*) / sup{g), that is, 
con/(s)/(l — £) > conf{r) > conf{s). □ 

The proposition states that if we could accept the errors, then sup{s) and 
conf{s) can be viewed as approximations of sup{r) and confer), respectively. 
That is, a set of association rules can be easily reconstructed from an ^-source 
with their approximate supports and confidences. In this sense, we can consider 
these rules to be approximately redundant {A-redundant) . 

Now we can define an approximate informative basis of association rules from 
which any significant rule can be reconstructed with its approximate values of 
support and confidence. 

Definition 6. (Approximate Informative Basis of Association Rnles) 

Let I? be a transaction database, £ be a user-defined parameter (0 < £ < 1 ). An 
approximate informative basis of the significant association rules w.r.t. T> and £, 
denoted by AIB{T>,s), is defined as the set of A-sources whose confidences are 
not less than (1 — e)minconf: 

AIB{V, £) = { s I s e AS{T>, s) A conf{s) > (1 — s)minconf }. 



Theorem 1. 

Weak- Soundness of AIB{T>,e) : 

Any association rule r reconstructed from s in AIB(fD, £) is significant or at 
worst A-significant 

^ For an association rule r, if sup{r) > minsup and minconf > conf{r) > (1 — 
e)minconf , we say that r is approximately significant (A-significant). 
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Completeness of AIB{T>,e) : 

For any significant association rule r, there exists an A-source s in AIB{T>, e) 

from which r can be reconstructed. 

Proof. 

Weak-Soundness: Let r = li (hXh) be an association rule reconstructed 
from an A-source s = g ^ {f* \ g) : { {g, f) , AGC{F) ) in AIB{T>,e). Then, 
there exists an AGC-tuple \g*,f*) in AGC{F) such that g* C I 2 C f*. 

From Proposition 3, sup{r) > sup{s) and conf{r) > conf{s). Since f* is a 
frequent closed itemset, sup{f*) > minsup. From sup{s) = sup{f*), therefore, 
we have sup{r) > minsup. Furthermore, since conf{s) > {l — e)minconf, we im- 
mediately have confer) > (1 — e)minconf . Therefore, r is at worst A-significant. 
Completeness: Let r = ^ (^2 \ ^ 1 ) be a significant association rule. For each 

k, there exists an EGG-iuple {gt, fi) in EGC{V) such that gi G C f^. It 
should be noted here that since h C I 2 , fi C /2 holds. Assume that AGG{T>,e) 
is constructed based on a partition of F, Vj^. For the ifGC-tuple ( 32 , / 2 ), we 
can consider a cell F of Vj: such that /2 C /*, where f* is the maximum 
itemset in F. Therefore, there exists an AGG-tuple (g*,f*) in AGG{F) such 
that g* G g 2 C f 2 C f* . Furthermore, fi C f* holds. Therefore, s = gi ^ (/* \ 
9i) ■ { (5i)/i) ) AGG{F) ) is an A-source from which r can be reconstructed. 

Since r is a significant rule, sup{l 2 ) / sup{li) > minconf holds. By multiplying 
both sides by (1 — e), we obtain (1 — e)sup{l 2 ) / sup{li) > (1 — e)minconf . From 
Proposition 2, sup{f *)/{! — e) > sup{l 2 ) holds, that is, sup{f*) > {I — e) .sup{l 2 ) . 
Therefore, we have sup{f*) / sup{li) > (1 — s)minconf . Since sup{li) = sup{gi) 
and sup{f*)/sup{gi) = conf{s), AIB{V,e) contains the A-source s. □ 

From Theorem 1, it is ensured that once we have AIB(T>,e), no significant 
rule can be lost. 



4.4 Constructing Approximate Informative Basis 

Given a transaction database T>, minsup, minconf and a user-defined parameter 
s, we can construct an approximate informative basis w.r.t. T> and e, AIB(fD, s). 
The construction process is divided into three sub-tasks: 

1. Computing the set of AGG-tuples, EGG{T>). 

2. Computing an approximation of EGG{V), AGG{V,e). 

3. Constructing an approximate informative basis, AIB(fD,e). 

The first task can be performed by adopting a Close [3]-like algorithm and the last 
one is straightforward. An algorithm for the second task, computing AGG{T>, e) 
from EGG{T>), is shown in Figure 1. In general, as e becomes larger, the num- 
ber of iteration for the while-loops decreases. The worst case complexity of the 
algorithm is O(A^), where N is the size of EGG{T>) (that is, the number of 
AGG-tuples in EGG{V)). 
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Input ; EGCiV) and e. 

Output : AGC{V,e). 

AGC{V,e) ^ (j}-, 

EG ^ (j)\ Rem ^ 0; Min ^ (p-, 
while EGG{V) ^ (j> do 

pick up t = (g, /) from EGG(T>)\ 
while EGC{V) ^ cf) do 

remove t' = {g' , f') from EGG(T>); 

If /' C / A sup{f)/sup{f) > 1 - e 
then EG ^ EG\J{g'}-, 
else Rem ^ Rem U {f'}; 
end 

end 

Min ^ the set of minimal elements of EG', 

for g € Min do 

AGC{V, e) ^ AGGip, e) U {(g, /)}; 

end 

EGG{V) ^ Rem-, 

EG <— 0; Rem <— (p-, Min <— </>; 

end 

Output AGCipD, e) 

Fig. 1. Algorithm for Constructing AGG{V, e) 



Example: 

For the transaction database T> shown in Figure 2, we try to construct an 
approximate informative basis. In the database, each itemset is represented in a 
simple form. For example, an itemset {a, b, c} is denoted as abc. We assume here 
that minsup =1/6 and £ = 0.7. 

At first, the set of EGC-tuples, EGC{V), is computed. For the database, we 
can obtain the following 10 EGG-tuples: 

EGC{T>) = { (a, ac) : 3/6, (6,6) : 5/6, (c, c) : 5/6, (d,acd) : 2/6, (e, 6e) : 4/6, 
{ab,abc) : 2/6, (ae,abce) : 1/6, {be, be) : 4/6, 

{bd,abed) : 1/6, (ce, 6ce) : 3/6 }, 

where the value attached to each tuple is the support of the tuple. 

Then an AGC{T>, £) is constructed from EGG{V) according to the algorithm 
in Figure 1. For example, we have 

AGC{T>,e) = { (a,abce), (ce, o6ce), {d,abcd), (6, 6e), (e,6e), (c, 6c) }. 

It should be noted here that the set of frequent closed itemsets, 

T = { abee, abed, abe, bee, acd, ae, be, be, 6, c }, 
is divided into the following 4 cells: 
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ID itemset 

1 acd 

2 bee 

3 abce 

4 be 

5 abed 

6 bee 



Fig. 2. Example of Transaction Database 



Fi = { abce, abc, bee, ac }, 

F2 = { abed, acd }, 

F3 = { be, b } and 
F4 = { be, c }. 

That is, 

AGC{Fi) = { {a, abce), {ce,abce) }, 

AGC{F2) = { (d,abcd) }, 

AGGiFs) = { {b,be), (e,be) } and 
AGG{F4) = { {c,bc) }. 

Based on AGC{V, e), we can obtain the set of A-sources, AS{V, e), consisting 
of 20 sources. Assuming minconf = 0 . 85 , we have the following approximate 
informative basis consisting of 12 sources: 

AIB(V,e) = { Si =a^{abce\a) : {{a,ac), AGG(Fi)), 

52 =a^{abcd\a) : {{a,ac), AGG{F2)) , 

53 =b^{be\b) : {{b,b),AGG{F:i)), 

54 =h^{bc\b) : {{b,b),AGG{Fi)), 

55 =c^{hc\c) : {{c,c),AGC{F^)), 

56 =d^{abcd\d) : {{d, acd) , AGC {F2)) , 
sr =e^{be\e) : {{e,be), AGG{Fn)) , 

ss = ab ^ {abce \ ab) : {{ab,abc), AGG{Fi)), 

sg = ab ^ (abed \ ab) : ((ab,abc),AGG(F2)}, 

s 10 = ae ^ (abce \ ae) : {(ae,abce), AGG(Fi)), 

sii = bd^ (abcd\bd) : {(bd,abcd), AGG(F2)), 

si2 = ce ^ (abce \ ce) : {(ce, bee), AGG(Fi)) }. 
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Table 1. Experimental Results 





minsup = 


0.1 






minconf = 0.7 


minconf = 0.5 


minconf = 0.3 


Close 


5,134 


9,290 


15,048 


Our System (e = 0.1) 


1,733 


2,985 


4,444 


Our System (e = 0.2) 


1,196 


1,793 


2,502 




minsup = 


0.05 






minconf = 0.7 


minconf = 0.5 


minconf = 0.3 


Close 


7,742 


15,594 


28,712 


Our System (e = 0.1) 


3,203 


5,817 


9,822 


Our System (e = 0.2) 


2,194 


3,600 


5,500 




minsup = 


0.01 






minconf = 0.7 


minconf = 0.5 


minconf = 0.3 


Close 


11,997 


28,458 


59,153 


Our System (e = 0.1) 


6,900 


13,290 


25,113 


Our System (e = 0.2) 


3,824 


6,357 


10,432 



For example, from the ^-source si, an association rule r = a {ac\ a) can 
be reconstructed with its approximate support and confidence, 1/6 (= sup(si)) 
and 1/3 (= conf{si)). On the other hand, its precise support and confidence 
are 1/6 and 1, respectively. We can easily verify that the error of the confidence 
surely follows Proposition 3. 



5 Experimental Results 



In this section, we present our preliminary experimental results. 

In order to verify an effectiveness of our method, we have implemented a 
system to compute an AIB based on the algorithms presented in the previous 
section. The algorithm Close has been implemented as well to compare with the 
original method by Bastide, et al. Our system and Close have been written in C 
and have been tested on a 400MHz Pentiumll PC with 160MB memory. 

For our experimentation, we have obtained “1984 United States Congres- 
sional Voting Records Database”, a database from the UCI Repository [7]. It 
consists of 435 transactions and the number of possible items is 17. Our system 
has computed AIBs for the database in various settings of parameters, minsup, 
minconf and s. 

The numbers of rules output by each system are summarized in Table 1, 
where the results obtained by the original method is referred to as Close. 

For each parameter setting, our system has output fewer rules compared to 
that by Close. In the most effective case, about 70% reduction has been achieved 
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compared to Close®. Even in the worst case, about 43% reduction has been 
achieved. Therefore, we can consider that our method is very effective to reduce 
the number of generated rules. 

6 Concluding Remarks 

In this paper, we have presented a method for constructing an approximate in- 
formative basis (AIB) for significant association rules from which any significant 
rule can easily be reconstructed with its approximate support and confidence. 
The maximum errors of these values are precisely evaluated by some formulae 
determined by a user-defined parameter e. Therefore, we can flexibly adjust the 
preciseness of these approximate values. Some experimental results have shown 
that our method can drastically reduce the number of rules to be generated com- 
pared to the original framework. Therefore, readability and understandability for 
the rules would be improved by providing an adequate value of e. 

As a next step of this study, we are planning to formalize a method for 
identifying actually interesting rules with their support and confidence in an 
interactive manner. In the initial stage, e is given a value close to 1 by a user 
and we obtain a rough AIB for which we can easily and completely check the 
contents. By checking them, the user selects several A-sources from which some 
interesting rules seem to be reconstructed. Then the user decreases the value of 
e to obtain a more precise AIB. It should be noted that the system presents 
only a part of the AIB which is relative to the A-sources previously selected 
by the user. Therefore, we can obtain a more precise AIB keeping the number 
of contents small. For the presented AIB, similar processes are iteratively per- 
formed until the user satisfactorily identifies interesting rules with their support 
and confidence. At each stage, since the system keeps the number of contents 
of presented AIB compact, the selection tasks by the user would not be costly. 
Therefore, such a system would be quite helpful for users who try to discover 
interesting rules easily. 

In order to construct such an interactive system, we are expecting that the 
efficiency of computing AIB has to be improved more. Our AIB is currently 
computed by adopting an extended algorithm of Close [3]. Although Close can 
efficiently identify the set of frequent closed itemsets, several new algorithms 
for the same task have been proposed recently, e.g., A-Close [4], CHARM [6] and 
CLOSET [5]. By adopting these algorithms, the efficiency of computation of AIB 
would be improved. 
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Abstract. Document retrieval can be considered as a basic but impor- 
tant tool for text mining that is capable of taking a user’s information 
need into account. However, document retrieval is a hard task if multi- 
topic lengthy documents have to be retrieved with a very short descrip- 
tion (a few keywords) of the information need. In this paper, we focus 
on this problem which is typical in real world applications. We experi- 
mentally validate that passage-based document retrieval is advantageous 
in such circumstances as compared to conventional document retrieval. 
Passage-based document retrieval is a kind of document retrieval which 
takes into account only small fractions (passages) of documents to judge 
the document relevance to the information need. As a passage-based 
method, we employ the method based on density distributions of key- 
words. This is compared with the following three conventional methods 
for document retrieval: the vector space model, pseudo-feedback, and 
latent semantic indexing. Experimental results show that the passage- 
based method is superior to the conventional methods if long documents 
have to be retrieved by short queries. 



1 Introduction 

The growing number of electronic textual documents has created the need of 
intelligent access to the information implied by them. The goal of text mining is 
to discover novel nuggets of information from a huge collection of documents to 
fulfill the need in the ultimate sense [1] . The unstructured nature of documents, 
however, makes it difficult to realize the goal in a general way. The current state- 
of-the-art is to approach the goal by integrating the tools developed so far in 
other related research areas [2], though their functionality and/or domains of 
interest are still restricted. In order to take a step forward, it would be required 
both to devise a novel combination of the tools and to polish them up. 

A typical scenario of text mining would be that (1) information extraction 
is utilized to obtain the information from documents, (2) data mining is applied 
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to the extracted information to derive novel information. In this scenario, infor- 
mation of interest is fixed in the first stage of the processing. Another possibility 
is mining based on a user’s ad-hoc information need. In this scenario, document 
retrieval is a tool applied at the first stage of processing to select documents 
analyzed at later stages. Although the research area of document retrieval has 
several decades of history, it is still not trivial to retrieve documents relevant to 
a user’s need. Two major problems uncovered through the research activities are 
as follows: 

Multi-topic documents. If a document is beyond the length of abstracts, it 
often contains several topics. Even though one of them is relevant to the 
user’s need, the rest are not necessarily relevant. As a result, these irrelevant 
parts severely disturb the retrieval of documents. 

Short queries. It is common that a user’s information need is fed to a system as 
a set of query terms. However, it is not an easy task for a user to transform the 
need into query terms. From the analysis of Web search logs, for example, it 
is well-known that typical users issue quite short queries consisting of several 
terms. Such queries are too poor to retrieve documents appropriately. 

In conventional document retrieval, the retrieval of multi-topic documents 
is a hard task since there is no way to avoid the influence of irrelevant parts 
of documents. In order to tackle this problem, some researchers have proposed 
a different way of retrieval called passage-based document retrieval [3,4,5]. In 
passage-based document retrieval, documents are retrieved based only on frac- 
tions (passages) of documents in order not to be disturbed by the irrelevant 
parts. It has been shown in the literature that passage-based document retrieval 
outperforms conventional document retrieval in processing long documents [5]. 
To handle passages as units of retrieval is advantageous to the application to text 
mining since it also gives a clue to extract relevant parts from the documents. 

In this paper, we experimentally validate that, for the second (short queries) 
problem, passage-based document retrieval is also superior to conventional doc- 
ument retrieval. As a method of passage-based retrieval, we utilize a method 
based on “density distributions” [6] . This method segments documents into pas- 
sages dynamically in response to a query. As conventional methods, we employ 
the following three [7,8]: the vector space model, pseudo- feedback and latent 
semantic indexing. 

2 Conventional Docnment Retrieval 

Let us begin with an overview of conventional document retrieval methods. The 
task of document retrieval is to retrieve documents relevant to a given query from 
a fixed set of documents or a document database. In a common way to deal with 
documents as well as queries, they are represented using a set of index terms 
(simply called terms from now on) by ignoring their positions in documents and 
queries. Terms are determined based on words of documents in the database. 
In the following, ti (1 < i < m) and dj (1 < j < n) represent a term and a 
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document in the database, respectively, where m is the number of terms and n 
is the number of documents. 

2.1 Vector Space Model 

The vector space model (VSM) [7,8] is the simplest retrieval model. In the VSM, 
a document dj is represented as a m dimensional vector: 

— (tCij , . .. , ) , (1) 

where T indicates the transpose, Wij is a weight of a term ti in a document dj. 
A query q is likewise represented as 

q = (yW\q, ...,Wmq) t ( 2 ) 

where Wig is a weight of a term ti in a query q. 

So far, a variety of schemes for computing weights have been proposed. In 
this paper, we employ a standard scheme called “tf-idf” defined as follows: 

Wij = tUj ■ idfj , (3) 

where tf^ is the weight calculated using the term frequency (the number of 
occurrences of a term ti in a document dj), and idf^ is the weight calculated 
using the inverse of the document frequency rii (the number of documents which 
contain a term ti). In computing tfij and idfj, the raw frequency is usually 
dampened by a function. We utilize tf ij = \/7b and idfi = log(n/ni) where n is 
the total number of documents. The weight Wig is similarly defined as Wig = \fjiq 
where iig is the frequency of a term ti in a query q. 

The result of retrieval is represented as a list of documents ranked according 
to their similarity to the query. The similarity sim(dj , q) between a document 
dj and a query q is measured by the cosine of the angle between dj and q: 

where jj • jj is the Euclidean norm of a vector. 

2.2 Pseudo-Feedback 

A problem of the VSM is that a query is often too short to rank documents 
appropriately. To cope with this problem, it has been proposed to enrich an 
original query by expanding it with terms in documents. 

A method called “pseudo-feedback” [8] is known as a way to obtain the terms 
for expansion. In this method, first, documents are ranked with an original query. 
Then, highly ranked documents are assumed to be relevant and their terms are 
incorporated into the original query. Documents are ranked again by using the 
expanded query. 
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In this paper, we employ a simple variant of pseudo-feedback. Let if be a set 
of document vectors for expansion given by 



E = 




sim(d^, q) 
maxj sim(di, q) 




( 5 ) 



where q is an original query vector and t is a threshold of the similarity. The 
sum dg of document vectors in E: 



■*. = E 4 

d^eE 



( 6 ) 



can be considered as enriched information about the original query. Then, the 
expanded query vector q' is obtained by 



' = + 

^ lloli lid. 



( 7 ) 



where A is a parameter for controlling the weight of the newly incorporated 
component. Finally, documents are ranked again according to the similarity 
sim{dj,q') to the expanded query. 



2.3 Latent Semantic Indexing 

Latent semantic indexing (LSI) [7,8] is another well-known way to improve the 
VSM. Let D he a, term-by-document matrix defined by 

D = {di,...,dn) , (8) 

where dj = dj/||dj||. By applying the singular value decomposition, D is de- 
composed into the product of three matrices: 

D = USV'^ , (9) 

where U and V are matrices of size mxr and nx r (r = rank(I?)), respectively, 
and S = diag((Ti, ..., CTr) is a diagonal matrix with singular values > aj 

if * ^ j)- Each row vector in U (V) corresponds to a r-dimensional vector 
representing a term (document). 

By keeping only the k{< r) largest singular values in S along with the cor- 
responding columns in U and V, D is approximated by 

Dk = UkSkVif , (10) 

where Uk, Sk and Vk are matrices of size mxk, kxk and nxk, respectively. This 
approximation allows us to uncover “latent” semantic relation among terms as 
well as documents. 
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The similarity between a document and a query is measured as follows. Let 
Vj = {vji, ...,Vjk) be a row vector in Vk = {vji) (1 < j < n, 1 < z < fc). In the 
fc-dimensional (approximated) space, a document dj is represented as 

d* = SkvJ . ( 11 ) 

An original query is also represented in the fc-dimensional space as 

q* = uU- ( 12 ) 

Then the similarity is obtained by sim(d*, q*). 



3 Passage-Based Document Retrieval 



Passages used in passage-based methods can be classified into three types: dis- 
course, semantic and window [3]. Discourse passages are defined based on dis- 
course units such as sentences and paragraphs. Semantic passages are obtained 
by segmenting text at the points where the subject of text changes. Window 
passages are determined based on the number of terms. 

In this paper, we employ a passage-based method with window passages 
called “density distributions” (DD). The density distribution was first introduced 
to locate the descriptions of a word [9] and applied to passage retrieval by some 
of the authors [6]. 

The fundamental idea of DD is that parts of documents which densely contain 
the terms in a query are relevant to it. Figure 1 shows an example of a density 
distribution. The horizontal axis indicates the positions of terms in a document. 
The distribution of query terms in the document is shown as spikes in the figure: 
their height indicates the weight of a term. The density distribution shown in the 
figure is obtained by smoothing the spikes with a window function. The details 
are as follows. 

Let aj(l) {I < I < Lj) be a term at the position I in a document dj where 
Lj is the length of a document dj measured in terms. The weighted distribution 
bj{l) of terms in a query q is defined by 



f w^q ■ idfi if aj{l) = tig , 
( 0 otherwise . 



(13) 



Smoothing of bj{l) enables us to obtain the density distribution ddj{l) for a 
document dj: 

W/2 

ddj{l)= , (14) 

x=-W/2 

where f{x) is a window function with a window size W. We employ the Hanning 
window function defined by 



fix) 



1(1 -I- cos27t-^) if |x| < lF/2 , 
0 otherwise , 



(15) 
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Term Position 

Fig. 1. Density distribution. 




Fig. 2. Hanning window function. 



whose shape is illustrated in Fig. 2. 

In order to utilize DD as a passage-based document retrieval method, a score 
of a document is calculated using the density distribution. The score of dj for a 
query q is obtained as the maximum value of its density distribution as follows: 

score{dj,q) = mAxddj{l) . (16) 

This score is used to rank documents according to a query. 



4 Experimental Comparison 

In this section, we show the results of the experimental comparison. After the 
description of the test collections employed for the experiments, our methods 
for evaluating the results are described. Then, the results of experiments are 
presented and discussed. 
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Table 1. Statistics about documents in the test collections. 





MED 


CRAN 


CR 


FR 


size [MB] 


1.1 


1.6 


235 


209 


no. of doc. 


1,033 


1,398 


27,922 


19,789 


no. of terms^ 


4,284 


2,550 


37,769 


43,760 


doc. len.* min. 


20 


23 


22 


1 


max. 


658 


662 


629,028 


315,101 


mean 


155 


162 


1,455 


1,792 


median 


139 


142 


324 


550 



f : counted in words after stemming and eliminating stopwords 
J : counted in words before stemming and eliminating stopwords 



Table 2. Statistics about queries in the test collections. 





MED 


CRAN 


CR 

title desc narr 


FR 

title desc narr 


no. of queries 


30 


225 


34 


85 


query len.^ min. 


2 


3 


2 4 12 


1 3 12 


max. 


33 


21 


7 19 79 


9 22 93 


mean 


10.8 


9.2 


3.0 7.7 28.7 


3.5 10.4 37.0 


median 


9.0 


9.0 


3.0 6.5 24.5 


3.0 10.0 34.0 



f : counted in words after stemming and eliminating stopwords 



4.1 Test Collections 

We made a comparison using four test collections: MED (medicine), CRAN 
(aeronautics), FR (federal register), CR (congressional record). The collections 
MED and CRAN are available at [12], and FR and CR are contained in the 
TREC disks No. 2 and No. 4, respectively [13]. All collections are provided with 
queries and their groundtruth (a list of documents relevant to each query). For 
these collections, terms used for document representation were obtained by stem- 
ming and eliminating stopwords 

Tables 1 and 2 show some statistics about the collections. In Table 1, an 
important difference is the length of documents: MED and CRAN consist of ab- 
stracts, while FR and CR contain much longer documents. In Table 2, a point to 
note is the difference of query length. In the TREC collections, each information 
need is described by query types of different length. In order to investigate the 
influence of query length, we employed three types: “title” (the shortest represen- 
tation), “desc” (description; medium length) and “narr” (narrative; the longest). 



Words which convey no meaning such as “the” . 



1 
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4.2 Evaluation 

Average Precision. A common way to evaluate the performance of retrieval 
methods is to compute the (interpolated) precision at some recall levels. This 
results in a number of recall / precision points which are displayed in recall- 
precision graphs [7]. However, it is sometimes convenient for us to have a 
single value that summarizes the performance. The average precision (non- 
interpolated) over all relevant documents [7,12] is a measure resulting in a single 
value. The definition is as follows. 

As described in Sect. 2, the result of retrieval is represented as the ranked list 
of documents. Let r{i) be the rank of the i-th relevant document counted from 
the top of the list. The precision for this document is calculated by i/r{i). The 
precision values for all documents relevant to a query are averaged to obtain a 
single value for the query. The average precision over all relevant documents is 
then obtained by averaging the respective values over all queries. 

For example, consider two queries q\ and q 2 which have two and three relevant 
documents, respectively. Suppose the ranks of relevant documents for q\ are 2 
and 5, and those for <72 1, 3 and 10. The average precision for qi and <72 is 

computed as (l/2-|-2/5)/2 = 0.45 and (l/H-2/3-|-3/10)/3 = 0.66, respectively. 
Then the average precision over all relevant documents which takes into account 
both queries is (0.45 -I- 0.66)/2 = 0.56. 

Statistical Test. The next step for the evaluation is to compare the values of 
the average precision obtained by different methods. An important question here 
is whether the difference in the average precision is really meaningful or just by 
chance. In order to make such a distinction, it is necessary to apply a statistical 
test. 

Several statistical tests have been applied to the task of information re- 
trieval [10,11]. In this paper, we utilize the test called “macro t-test” [11] (called 
paired t-test in [10]). The following is the summary of the test as described in 
[ 10 ]. 

Let Oi and bi be the scores (e.g., the average precision) of retrieval methods 
A and B for a query i and define di = ai — bi. The test can be applied under the 
assumptions that the model is additive, i.e., di = p.+Si where p, is the population 
mean and Si is an error, and that the errors are normally distributed. The null 
hypothesis here is p = 0 {A performs equivalently to B in terms of the average 
precision), and the alternative hypothesis is /x > 0 (A performs better than B). 

It is known that the Student’s t-statistic 




follows the t-distribution with the degree of freedom of n — 1, where n is the 
number of samples (queries), d and are the sample mean and variance: 




( 18 ) 
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Table 3. Values of parameters. 





parameter 


MED, GRAN 


CR, FR 


PF 


weight A 




1.0, 2.0 






1.0, 2.0 




threshold r 


0.71 - 


0.99 step 0.02 




0.71 - 


0.99 step 0.02 


LSI 


dimension k 


60 '■ 


500 step 20 




50 ' 


~ 500 step 50 


DD 


window size W 


20 


200 step 20 


20 - 


100 step 10, and 150,200,300 



Table 4. Best parameter values. 





MED 


CRAN 


CR 


FR 


title desc narr 


title desc narr 


PF A 


2.0 


1.0 


1.0 1.0 1.0 


1.0 2.0 1.0 


T 


0.71 


0.85 


0.85 0.85 0.93 


0.83 0.71 0.71 


LSI k 


60 


260 


300 500 400 


350 500 500 


DD W 


80 


100 


50 90 200 


90 40 40 



1 ” 

= ■ (19) 

i—1 

By looking up the value of t in the t-distribution, we can obtain the P- 
value, i.e., the probability of observing the sample results di {1 < i < n) under 
the assumption that the null hypothesis is true. The P-value is compared to a 
predetermined significance level a in order to decide whether the null hypothesis 
should be rejected or not. As significance levels, we utilize 0.05 and 0.01. 

4.3 Results for the Whole Collections 

The methods PF (pseudo-feedback), LSI (latent semantic indexing) and DD 
(density distributions) were applied by ranging the values of parameters as shown 
in Table 3. Figure 3 exemplarily illustrates the variation in the average precision 
when varying the threshold r in PF (A = 1.0; left) and the window size W in 
DD (right). The lines in the graphs were obtained from the experiments on the 
collections CR and FR. Since these collections have three query sets (title, desc, 
narr), six lines are shown in each graph. In the graph of PF, the average precision 
fluctuated slowly but irregularly with the threshold r. On the other hand, the 
average precision of DD partly changed rapidly on smaller window sizes, and 
showed a tendency to converge as the window size became larger. Since better 
performance of DD was often obtained with smaller window sizes, DD would be 
more sensitive to the parameter W than PF to t. Although it is an important 
topic to develop a method of automated adjustment of the window size, it is 
beyond the scope of this paper; we simply selected the best values of parameters 
which are shown in Table 4. 

Table 5 shows the average precision obtained by using the best parameter 
values. In Table 5, the best and the second best values of average precision 




164 



K. Kise et al. 



PF (>.=1 .0) 



o 

00.25 

0 

^ 0.2 

0 

CD 

20.15 

0 

ro 0.1 



). 0 § 



^ CR 


/title 


-e- CR 


/ desc 


0 CR 


/ narr 


— FR 


/ title 


--- FR 


/ desc 


« FR 


/ narr 




.7 0.8 0.9 1 

threshold t 



DD 




window size W 



Fig. 3. Variations in the average precision. 



Table 5. Average precision over all relevant documents. 





MED 


CRAN 


CR 


FR 


title 




desc 


narr 


title 




desc 




narr 


VSM 


0.530 


0.401 


0.127 




0.172 


0.172 


0.098 




0.094 




0.120 


PF 


0.640 


0.450 


0.169 




0.195 


0.184 


0.115 




0.123 




0.119 




(+20.8%) 


(+12.2%) 


(+33.1%) 


(+13.4%) 


(+7.0%) 


(+17.3%) 


(+30.9%) 


( 


-0.8%) 


LSI 


0.685 


0.444 


0.101 




0.128 


0.134 


0.043 




0.051 




0.075 




(+29.2%) 


(+10.7%) 


(-20.5%) 


(- 


-25.6%) 


(-22.1%) 


(-56.1%) 


(■ 


-45.7%) 


(- 


-37.5%) 


DD 


0.507 


0.370 


0.165 




0.159 


0.151 


0.177 




0.207 




0.237 




(-4.3%) 


(-7.7%) 


(+29.9%) 


( 


-7.6%) 


(-12.2%) 


(+80.6%) 


(+120%) 


(+97.5%) 



( ) : difference to the VSM 



among the methods are indicated in bold and italic fonts, respectively. In the 
parentheses, the ratio of difference to the VSM is noted. Let x and y be the 
average precision by the VSM and a method for comparison, respectively. The 
ratio is calculated by {y — x)/x. Thus a positive and a negative value indicate 
gain and loss, respectively. 

The results of the macro t-test for all pairs of methods are shown in Table 6. 
The meaning of the symbols such as “1^”, “>” and is summarized at the 
bottom of the table. For example, the symbol “>” was obtained in the case of 
DD compared to the VSM for the MED collection. This indicates that, at the 
significance level a = 0.05, the null hypothesis “DD performs equivalently to the 
VSM” is rejected and the alternative hypothesis “DD performs worse than the 
VSM” is accepted. At a = 0.01, however, the null hypothesis cannot be rejected. 
Roughly speaking, “A > (<c)R”, “A > (<)R” and “A ~ R” indicate that “A is 
almost guaranteed to be better (worse) than B” , “A is likely to be better (worse) 
than B” and “A is equivalent to B” , respectively. 

The results shown in Tables 5 and 6 can be summarized as follows: 
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Table 6. Results of the macro t-test. 



methods 
A B 


MED 


CRAN 


CR 


FR 


title desc 


narr 


title desc 


narr 


DD - VSM 


< 


< 








> 


> 


> 


DD- PF 


< 


<C 




< 




> 


> 


> 


DD - LSI 


< 


< 








> 


> 


> 


PF - VSM 


> 


> 


> 








> 




PF - LSI 


< 




> 






> 


> 


> 


LSI - VSM 


> 


> 








< 


< 


< 



3>, <C : P-value < 0.01 

>, < : 0.01 < P-value < 0.05 
~ : 0.05 < P-value 



— For the collections of short documents (MED and CRAN), the methods PF 
and LSI outperformed the VSM and DD. 

— For the collection CR which includes long documents, the methods mostly 
performed equivalently. The exception was the performance of PF. As shown 
in Table 6, PF was better than the VSM and LSI for the shortest queries 
(title) as well as DD for the middle length queries (desc). 

Note that methods are found to be equivalent by the statistical test even 
though the ratios of the difference of the average precision are bigger than 
those for MED and CRAN. For example, PF outperformed the VSM for 
MED and CRAN with the ratios -1-20.8% and -1-12.2%, while DD was equiv- 
alent to the VSM for CR with the ratio -1-29.9% (cmp. Table 5). This is 
because, in the statistical test, not only the average precision but also its 
variance and the number of queries are taken into account. 

— For the collection FR which also includes long documents, on the other hand, 
DD clearly outperformed the other methods. The advantage of PF and LSI 
for the collections of short documents did not hold here. 

From the above results, the influence of the length of documents and queries 
to the performance of the methods remains unclear. Although it has been shown 
that DD is inferior to PF and LSI for short documents, DD outperformed the 
other methods only for one of the collections which contain long documents. 
This could be because of the nature of the collections CR and FR. Although 
these collections include much longer documents than MED and CRAN, they 
also include many short documents as shown by the gap between the mean and 
the median in Table 1. 



4.4 Results for Partitioned Collections 

In order to clarify the relation between the performance and the length of docu- 
ments and queries, we partitioned each of the collections CR and FR into three 
smaller collections as follows. Documents in the collections were first splitted 
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Table 7 . Statistics about the partitioned collections. 





CR 


FR 


relevant doc. 


irrel. 

doc. 


relevant doc. 


irrel. 

doc. 


short middle 


long 


short middle 


long 


no. of doc. 


251 


251 


252 


27,168 


148 


148 


148 


19,345 


doc. len. min. 


67 


604 


3,055 


22 


114 


1,554 


6,037 


1 


max. 


601 


3,029 629,028 


385,065 


1,512 


5,994 315,101 


124,353 


mean 


334 


1,315 33,550 


1,169 


859 


3,075 35,982 


1,528 


median 


303 


1,078 


11,236 


318 


835 


2,886 


17,037 


536 


no. of queries 


27 


30 


27 


— 


43 


44 


63 


— 



into two disjoint sets: documents relevant to at least one query, and those ir- 
relevant to all queries. The set of relevant documents was further divided into 
three disjoint subsets of almost equal size according to the length of documents: 
short relevant documents, middle length relevant documents, and long relevant 
documents. By combining each subset with the set of irrelevant documents, we 
prepared three partitioned collections called “short”, “middle” and “long”. As 
queries for each partitioned collection, we took the queries which are relevant to 
at least one document in the partitioned collection. Since some documents are 
relevant to more than one query, the number of queries does not sum up to the 
number of queries in the original collections (cmp. Table 2). The statistics about 
the partitioned collections are shown in Table 7. 

Using the best parameters as shown in Table 4, we computed the average 
precision for the partitioned collections. Figure 4 illustrates the results. Each 
graph in the figure represents the results for a pair of a set of partitioned col- 
lections and a query length. The horizontal axes of the graphs indicate the par- 
titioned collections. These graphs show that the conventional methods (VSM, 
PF, LSI) performed worse as the documents became longer. On the other hand, 
DD yielded almost equivalent results for all document lengths on CR collection, 
and even better results for the FR collection as the documents were longer. 

Table 8 shows the results of the statistical test for the partitioned collections. 
DD yielded significantly better results in most of the cases for the “long” parti- 
tions. These results confirm that passage-based document retrieval is better for 
longer documents, which has already been reported in the literature [5]. 

Let us now turn to the influence of the query length. Figure 5 illustrates 
the same results as in Fig. 4 but arranged in a different way. Here, each graph 
corresponds to a partitioned collection and the horizontal axes represent the 
query lengths. 

For the “short” partitioned collections, no clear relation between the effec- 
tiveness of the methods and the query length could be found. On the other hand, 
for the “middle” and “long” partitioned collections with the shortest queries (ti- 
tle), DD was always the best among the methods. For the “middle” CR collection 
with longer queries, DD performed worse than the other methods. For the “long” 
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CR / title CR / desc CR / narr 




short middle long short middle long short middle long 



FR / title FR / desc FR / narr 




short middle long short middle long short middle long 



Fig. 4. Average precision for the partitioned collections (horizontal axes : document 
length) . 

Table 8. Results of the macro t-test for the partitioned collections. 





methods 


CR 


FR 




A B 


title desc 


narr 


title desc 


narr 




DD - VSM 


< 




< 


< 




< 


short 


DD- PF 
DD - LSI 


< 


< 




< 




rVw- 




DD - VSM 




< 


< 


> 


> 




middle 


DD- PF 




< 


< 


> 








DD - LSI 








> 


> 






DD - VSM 


> 


> 


> 


> 


> 


> 


long 


DD- PF 


> 


> 




> 


> 


> 




DD - LSI 


> 


> 


> 


> 


> 


> 



CR collection and “middle” FR collection, the advantage of DD shrank as the 
query length became longer. These tendencies can also be found in Table 8. 

For the “long” FR collection, the difference of the average precision between 
DD and the best among the other methods was about the same for all query 
lengths. However, there were disparities in their P-values: the P-value obtained 
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CR / short 



CR / middle 





0.3 

0.25 

0.2 

0.15 

0.1 

0.05 

0 



CR / long 
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VSM 
0 PF 
* LSI 

-e- DD 



, — h 
• 



title 



desc 



FR / short 



FR / middle 






title 



desc 



Fig. 5. Average precision for the partitioned collections (horizontal axes : query length). 



■with the shortest queries (title) was about 10 and 100 times smaller than those 
with the middle length (desc) and the longest queries (narr), respectively. 

From the results obtained from the partitioned collections, we conclude that 
passage-based document retrieval outperforms conventional methods if relatively 
lengthy documents are retrieved with short queries. An explanation for this fea- 
ture of passage-based document retrieval could be as follows. If lengthy docu- 
ments are retrieved with short queries, it becomes more essential to take into 
account the proximity of query terms, as done only by the passage-based method. 
In other words, the passage-based method is capable of distinguishing a few query 
terms which are in the same context (located close to each other in a document) 
from those occurring in different contexts (far away from each other). 

5 Conclusion 

We have experimentally evaluated the effect of the length of documents and 
queries for document retrieval methods. The passage-based method which is ca- 
pable of ranking documents based on segmented passages has been compared 
with three conventional document retrieval methods. The results for a variety 
of document collections show that the passage-based method is superior to con- 
ventional methods for longer documents with shorter queries. This feature of 
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passage-based retrieval is essential if we consider document retrieval as a tool 
for text mining based on a user’s query, since (1) users tend to issue short queries, 
and (2) available documents are often longer than abstracts. 

In order to use passage-based document retrieval as a tool, however, the fol- 
lowing things should be further considered. First, the window size appropriate for 
analyzing documents should be determined automatically. Second, it is required 
for passage-based document retrieval to work for short documents equivalently 
to the best conventional method. These issues will be a subject of our future 
research. 
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Abstract. In this paper we describe some new results from ASTRA, a 
computational research aid for the formulation and analysis of process 
explanations in nuclear astrophysics. The program generates fusion and decay 
reactions for chemical elements by using its knowledge of quantum theory, and 
from these reactions constructs all theoretically possible reaction chains as 
process explanations for the nucleosynthesis of heavier elements. Earlier 
applications of ASTRA generated reactions of the elements and isotopes from 
hydrogen to oxygen, and found novel reactions and reaction chains for these 
elements. We have recently extended the system’s knowledge base for the 
elements from oxygen to sulphur. The new applications of ASTRA generated a 
series of hydrogen burning and helium burning reactions involving heavier 
elements such as fluorine, neon, sodium, magnesium, aluminium, silicon and 
sulphur. The program also generated a complete series of carbon, nitrogen and 
oxygen burning reactions. The new results of ASTRA lead to interesting details 
about the origin of the elements between oxygen and sulphur. 



1 Introduction 

As a specialized field of research in artificial intelligence and cognitive science, the 
computational study of scientific discovery has made important advances in its short 
history. Early research in the computational study of science was mainly concerned 
with modeling discoveries from the history of physics, chemistry and biology. The 
types of discoveries also ranged widely, including numeric laws (e.g. Langley, 1981; 
Langley, Simon, Bradshaw and Zytkow, 1987), qualitative relations (e.g. Jones 1986), 
structural models (e.g. Zytkow & Simon, 1986), and process models (e.g. Kulkarni & 
Simon, 1990). Although important in understanding the conditions of the discoveries, 
these models produced results already kown to the developers. 

In recent years interest increased towards the computational discovery of new 
scientific knowledge by means of new models (see, Langley, 1998). Among the recent 
areas of application is the computational design and construction of chemical and 
nuclear reaction processes. Three examples of such efforts are Hendrickson’s (1995) 
SYNGEN which designs the synthesis of some organic compounds from initial and 
intermediate compounds, Valdes-Perez’s (1995) MECHEM which has found new 
reaction pathways in catalytic chemistry, and Kocabas and Langley’s (1998; 2000) 
ASTRA system which has found new reactions and pathways in nuclear astrophysics. 
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There are other work such as descrihed by Lee, Buchanan, Mattison, Klopman, and 
Rosenkranz (1995) reporting novel results on whether chemicals cause cancer, and by 
Mitchell, Sleeman, Duffy, Ingram and Young (1997) with their system DAVICCAND 
that has found a new numeric relation in metallurgy. 

This paper focuses on some of the new results of ASTRA, which has been designed 
to support scientists in explaining fusion processes, the nucleosynthesis of elements 
and their relative abundance in stars. The program is a successor of BR-4 (Kocabas & 
Langley, 1995) which was developed as an integrated model for studying the role of 
predictions in particle physics, which in turn, was a successor of BR-3 (Kocabas, 
1991). 

In previous runs, ASTRA was given information about elements and isotopes from 
hydrogen to oxygen, and the program had generated the reactions and reaction 
networks for these isotopes. The formation of elements in this range has been 
extensively studied in nuclear astrophysics. Despite this, the program generated 
several new reactions and processes of interest to astrophysicists. Recently, we 
decided to extend the scope of the program to include elements and isotopes from 
oxygen to sulphur, to see if the program will produce interesting results for the 
elements in this range. The focus here will be on the new results of ASTRA with an 
emphasis on the system’s abilities as a research tool in astrophysics, rather than its 
behavior which was described in detail elsewhere (see, Kocabas & Langley, 1998; 
2000). 

In the next section, we summarize the research topics and methods in nuclear 
astrophysics, the area of application of ASTRA. Section 3 describes ASTRA in terms 
of its inputs, outputs, constraints and operations. Section 4 describes the new 
experimental results of ASTRA, Section 5 discusses these results, and Section 6 
discusses related reseach. The paper ends with a summary of the conclusions. 



2 The Domain of Nuclear Astrophysics 

Nuclear astrophysics is a branch of astrophysics that mainly concerns with the 
formation of heavier elements from hydrogen (H) and helium CHe), through a series 
of fusion and decay processes in stars. Another important concern is the irregularity in 
the relative abundances of elements, in particular the abundance carbon C^C), nitrogen 
(*"^A0 and oxygen C^O) compared to lighter elements like lithium Cu), beryllium (^Be) 
and boron C^B). Exploration of the processes in which the heavier elements from 
oxygen (‘^O) to iron C^Fe) are formed is yet another main topic in this field. 

According to the current astrophysical theories, stars go through several stages in 
their lifetimes. The first stage involves ‘hydrogen burning’ in which hydrogen is 
transformed into helium. Astrophysicists propose several different pathways 
(Audouze & Vauclair, 1980, p. 52; Williams, 1991, p. 351) to account for hydrogen 
burning in stars. Later stages involve more complex reactions and processes such as 
helium burning, and carbon, nitrogen and oxygen burning. 

Astrophysicists explain nucleosyntheses, by first selecting a stellar model in 
thermal equilibrium which makes certain assumptions about the mass, temperature, 
density, and the element distribution in the stellar plasma. Then they formulate the 
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possible and most likely reactions by using several quantum constraints and rate 
calculations. They then use the reactions with high rates to construct sets of reaction 
pathways which they call ‘reaction networks’. 

In onr previons work (Kocabas & Langley, 1998; 2000) we examined the results of 
ASTRA on several research topics concerning the formation of the lighter elements 
from hydrogen to oxygen. These were: 1) hydrogen burning processes, 2) helinm 
burning processes, 3) formation of carbon, nitrogen and oxygen through hydrogen and 
helinm burning, and other fusion chains, 4) the role of neutrons in such processes, and 
5) the anomaly in the relative abundance of the light elements. 

In evaluating the results of ASTRA we examined a number of books and journal 
papers on nuclear astrophysics, notably the following work: Audouze & Vauclair 
(1980); Clayton (1983); Fowler (1986); Fowler, et ak, 1967; Fowler et ah, 1975; 
Harris & Fowler, et ak, 1983; Cujec & Fowler, 1980; Kippenhahn & Weigert (1994); 
Lang (1974); Williams (1991); and Adelberger, E.G., et ak (1998). 

3 System Description of ASTRA 

Before we describe our application of ASTRA to nuclear astrophysics with some of 
the earlier and the new results, we briefly describe its inputs, outputs and procedures. 
A more detailed description can be found in Kocabas and Langley (1998). The 
program operates in two stages: the first generates all theoretically valid reactions, and 
the second produces reaction chains as process explanations for the nucleosynthesis of 
elements. 

3.1 Generating Reactions 

The first stage of ASTRA takes as input descriptions for a set of elements and 
isotopes. The current version includes information about 68 such entities. Each entity 
is characterized in terms of five quantum properties: rest mass (in MeV/c^), electric 
charge, spin counts, lepton counts, and baryon counts. ASTRA also has the related 
rules concerning the conservation of these quantum properties in the reactions. 

Using this information, ASTRA generates all collision and decay reactions among 
these elements that obey the conservation laws, together with their energy emissions, 
or Q-values, in terms of mega electron volts (MeV). The reactions generated by the 
program are in the form: -> Pn , m = 1,2,3', n = 1,2,3 where R^ and are the 

sets of the reacting and resulting elements respectively, and m and n are the number of 
elements in the sets. (Eor m = 1, m=2 and m=3 the formula represents decays, and 
double and triple collision reactions respectively). An example of the output of this 
module for the fusion reactions of hydrogen {H) with sodium (^^Na) is as follows:^ 

H +^^Na + 11.68 (MeV) 

H +^^Na ->^*Na + nu + 6.18 

H + --> ^°Ne + ‘'He + 2.38 . 



* The reaction formulations of ASTRA are based on neutral atoms. For this reason, there 
appear minor differences with textbook notations, such as in the second reaction above 
whose textbook version is // + —> ^‘'Na + /e + nu, instead of H + -> ^‘'Na + 



nu. 
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In each example, hydrogen and sodium (on the left hand side) combine to form one or 
more new substances (on the right hand side), along with the total energy emissions in 
MeV. 

For the runs described in this paper, we provided ASTRA with information about 
the elements from hydrogen to sulphur, their isotopes and a few elementary particles 
like the electron, proton, neutron and the neutrino with their antiparticles, giving a 
total of 68 distinct entities. From these, the system generated more than 600 different 
reactions. We manually eliminated minor variations such as ^He + ^Be —> + e + 

/e and ^He + + nu + /nu, leaving 472 reactions that included 344 fusion 

reactions and 28 decays. 

3.2 Generating Reaction Chains 

Taking as input the reactions generated by the first stage, ASTRA generates the 
reaction chains for an element E from a small set of basic elements/isotopes (E) that 
we assume as given. The system uses a depth-first, backward chaining search to 
construct the reaction chains. On the first step, ASTRA finds those reactions that give 
as an output the final element E. Upon selecting one of these reactions, R, it 
recursively finds those reactions that give as an output one of more R’s input 
elements. The algorithm continues this process, halting its recursion when it finds a 
reaction chain for which all the reacting elements are in (E), or when it cannot find a 
reaction off which to chain. ASTRA generates all possible reaction chains in this 
systematic manner. 

4 New Results of ASTRA 

In this section we report the new results of our tests with ASTRA concerning 
hydrogen-, helium-, carbon- and oxygen-burning reactions. We start with proton, 
electron and neutron capture reactions of heavier elements such as oxygen, fluor, 
neon, sodium, magnesium, aluminium, silicon and phosphorus. 

4.1 Proton, Electron, and Nentron Captures 

Proton captures are an important class of exothermic reactions that also take part in 
processes transforming hydrogen into helium as will be described below. Proton 
capture by an atomic nucleus turns it into another element with one higher atomic 
number. ASTRA finds 33 examples of proton captures given in astrophysics 
literature (e.g., Fowler, et al., 1967, 1975, 1983) for elements from hydrogen to 
oxygen (^^O), and 20 more for elements from oxygen to sulphur. 

ASTRA’ s first stage predicts that all elements from hydrogen to sulphur (^^5), with 
the exception of ‘*Ele, participate in exothermic proton capture. The program 
produces 46 such reactions for elements from hydrogen to oxygen, including all 33 
examples we have found in texts, but also 13 others which we have not seen in 
astrophysics texts that we examined. The program also finds 72 proton captures for 
elements from oxygen {‘^O) to sulphur (^^5), including the 20 such reactions cited in 
the same literature. Three examples of such proton captures are. 
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H + ‘^F -> ^°Ne 
H + 

H + -> ^^Si. 

In these reactions, proton captures by fluorine, sodinm and aluminium, transforms 
them into neon, magnesinm and silicon, respectively. 

Also, all the isotopes from oxygen to snlphur, with the exception of the isotopes of 
neon and magnesinm, participate in exothermic proton captures that produce helium 
CHc). Three examples to such reactions are, 

H + -> ^He + 

H + -> ''He + ^°Ne. 

H + -> "He +^"Mg. 



Electron captnre reactions are weak interactions in which an electron is absorbed by 
the atomic nncleus to be transformed into one with a smaller atomic number. In the 
process, the electron is combined with a proton in the nncleus, effectively 
transforming it into a nentron with the emission of a nenfrino: 

e + p --> n + nu. 

ASTRA’ s firsf sfage produces 6 election capture reactions for elements from 
hydrogen to oxygen of which only fhe one just given appears in astrophysics texts. 
The program also found 8 election capture reactions for elements from oxygen fo 
sulphur, none of which we have seen in fhe texts. 

In neutron capture, an element combines with a neutron to form a heavier isotope of 
the same element. We fonnd 17 neutron captnres for elemenfs from hydrogen to 
oxygen in the literatnre, while ASTRA predicts 59 such reactions that are theoretically 
possible for fhe same elements. Some examples of fhese reactions can be fonnd in 
Kocabas and Langley (1998). Recent runs of the system generated 76 reactions for 
elements from oxygen fo sulphur. Three examples of such neutron capture reactions 
are. 



n 


+ 


ISp 


-> 


‘^F 


n 


+ 




-> 




n 


+ 




-> 





Here, as indicated above, in each case the nuclens that absorbs the neutron turns into a 
heavier isotope of fhe same element. 

4.2 Hyrogen Burning Processes 

The transformation of hydrogen into helinm in a series of nuclear processes which 
take place in main sequence stars i the principal sonrce of energy. The sfandard 
reaction chains given in astrophysics texts (e.g. Audouze & Vauclair, 1980, p. 52; 
Williams, 1991, p. 351) for helium synthesis in such stars are the hydrogen-burning 
processes called “proton-proton” or pp chains. Other hydrogen bnrning reactions that 
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appear in texts involve heavier elements carbon, nitrogen and oxygen, and the 
pathway is called the CNO-chain. ASTRA produces all known CNO-chains, in 
addition to one viable variant using the electron capture of (see, Kocabas & 

Langley, 1998). 

We have tested ASTRA on hydrogen burning reactions involving the elements 
heavier than oxygen. Such reactions are hypothesized to occur in stars several times 
larger than the sun. The program found four hydrogen burning chains involving the 
elements fluorine, neon, sodium, magnesium, silicon, phosphorus and sulphur. One of 
these processes is 

H + -> + nu 

H + ^^Mg -> 

H + -> 

+ e -> ^^Al + e + nu 
H + ^^Al -> ^‘*Mg + ''He 

4 H -> ‘*He + 2 nu . 

In this process four hydrogen atoms in effect, transform into one helium atom, while 
two neutrinos are also emitted. We did not see any of these processes in the texts that 
we examined, but we presume that they are known to astrophysicists. 

4.3 Helium Burning Processes 

The origin and the relative abundance of carbon and oxygen has been one of the main 
concerns of astrophysics. The standard account (e.g., Fowler, 1986, pp. 5-6) relies on 
the process of helium-burning, in which helium nuclei react to form carbon and 
oxygen in the following steps: 

"He + "He -> 

"He + -> ‘^C 

"He + '^C->‘^0 . 

In its earlier runs, ASTRA found an alternative to this process which astrophysicists 
qualified as more likely in neutron-rich stellar media (see, Kocabas & Langley, 2000). 

ASTRA finds 25 exothermic helium burning reactions involving the range of 
elements from oxygen to silicon, including the 16 such reactions cited in the texts. 



Some of these reacitons are. 




He + 0 -> Ne 


5.16 (MeV) 


"He +‘‘^F -> 


10.5 


"He + ‘‘"F -> H + ^^Ne 


1.72 


"He + ^°Ne -> ^"Ug 


9.3 


"He + ^^Ne -> ^'’Mg. 


10.6 
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‘*He + 
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^^Al 
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‘*He + 
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^''Si 
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9.7 


H- ^^Al -> 
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31s 


4.2 


H- ^^Al -> 
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r , 50p. 

+ Si 


2.42 


^He + 


28 p • 32 p 

Si -> s 




6.9 


‘*He + 


28c’ ^ 

Si -> nu + 


32p 


4.6 


‘*He + 


29p. ^ 33c 

Si -> s 




7.2 



Among these reactions those that emit neutrinos (mm and /mm) are weak interactions 
which are much slower than the other alpha capture reactions. Astrophysicists 
generally ignore the weak reactions for their slow rates, except in processes that rely 
on such weak reactions. 

A careful comparison of the proton capture, neutron capture and helium burning 
reactions produced by ASTRA with the natural abundances of the elements from 
oxygen to sulphur in the CRC Handbook (80* ed., D.R.Lide, 1999-2000) reveals an 
interesting result: The elements fluorine, neon, sodium, magnesium, silicon, 
phosphorus and sulphur in the solar system must have been formed by alpha capture 
processes, rather than proton or neutron captures. This is because, the stable isotope 
abundances of these elements indicate a parallelism with the stepwise alpha-capture 
(helium burning) of the stable lighter isotopes of the elements in the series (see Table 
1). Indeed, the two alpha capture chains ^°Ne, ^^Mg, ^^Si, and ^^Na, ^^Al, 
^‘P) contain the most abundant isotopes of these elements. These processes may have 
been accompanied by carbon, nitrogen and oxygen burning processes which produce 

Mg, S and S respectively as shown in the next subsection. 

1 9 20 23 

Although proton capture reactions explain the relative abundance of F, Ne, Na, 
^‘*Mg, ^^Al, ^^Si, and they fail to explain the relative abundance of ^‘P. Similarly, 
neutron capture reactions fail to explain the relative abundances of ^^Ne, ^“*Mg and 
^^Si. Yet, stepwise alpha capture explains the relative abundances of all the isotopes 
in the series. We are currently investigating the astrophysical literature on the origins 
of the elements from fluorine to sulphur before claiming any novelty on this issue. 
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Table 1. Relative abundances of some isotopes for elements from oxygen to sulphur. 



isotope 


% abundance 


16q 


99.76 


19p 


100 


^°Ne 


90.48 


^^Na 


100 


""Mg 


78.99 


^’ai 


100 


^*^Si 


92.23 


31p 


100 


32s 


95.0 



isotope 


% abundance 


18q 


0.2 


18p 


0 


^^Ne 


9.25 


^^Na 


0 


“Mg 


11.01 


“ai 


0 


^“•Si 


4.67 


30p 


0 


3^S 


4.21 



4.4 Carbon, Nitrogen, and Oxygen Burning 

Carbon burning, in which two carbon atoms fuse together to produce heavier 
elements, takes place after the helium burning stage in a star. ASTRA finds four 
carbon burning reactions which produce the elements neon, sodium, and magnesium: 





+ 


‘^C -> 


"'Mg 


+ 


14.4 (MeV) 




+ 


-> 


nu + ^‘*Na 


+ 


8.9 




+ 


-> 


H + 


+ 


2.72 




+ 


-> 


^He +^°Ne 


+ 


5.1 



In nitrogen burning, two nitrogen atoms fuse together to form elements ranging from 
oxygen to silicon. ASTRA finds 10 such reactions: 





+ 


‘^N -> 




+ 


27.82 (MeV) 


"at 


+ 


-> 


nu + 


+ 


23.12 




+ 


-> 


n + 


+ 


10.65 




+ 


-> 


H + 


+ 


16.24 




+ 


-> 


D + ^'‘Al 


+ 


5.52 




+ 


-> 


^He + 


+ 


4.52 




+ 


-> 


‘*He + ^^Mg 


+ 


17.72 




+ 


-> 


^Be + “Ae 


+ 


8.32 




+ 


-> 


!2c + 16q 


+ 


10.46 




+ 


^‘*N -> 




+ 


0.72 



Finally, ASTRA formulates the following oxygen burning reactions in which two 
oxygen atoms fuse together in exothermic reactions, and the elements magnesium, 
silicon, phosphorus and sulphur are generated: 
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Mq 


+ 


‘'’0 -> 




I6q 


+ 


‘^0 -> 


nu + 


I6q 


+ 


‘^0 -> 


n + 


I6q 


+ 


‘^0 -> 


H + ^‘P 


I6q 


+ 


‘^0 -> 


*He + 


I6q 


+ 


‘^0 -> 





+ 17.12 (MeV) 
+ 14.82 
+ 2.05 
+ 8.34 
+ 10.22 
+ 0.02 



Carbon, nitrogen and oxygen burning reactions happen only in massive stars as they 
require higher energies to initiate. The astrophysics texts that we examined mention 
only a few of these reactions, such as -> -> ^^Si, and + 

-> while ASTRA provides a full account of such reactions. 



5 Discussion of Results 



We have compared ASTRA’s earlier outputs involving the elements from hydrogen to 
oxygen to those available in astrophysics texts (Clayton, 1983; Audouze & Vauclair, 
1980; Kippenhahn & Weigert, 1994; Fowler et al., 1967, 1975, 1983; Cujec & Fowler, 
1980; Adelberger, E.G., et al. (1998), and discussed some of its results with 
astrophysicists. We received encouraging comments from domain experts on the 
earlier outputs (see, Kocabas & Langley, 2000). However, the reactions and processes 
of the light elements have already been studied extensively by nuclear astrophysicists. 
For this reason, we decided to extend the scope of the program to investigate the 
reactions and the processes of the elements from oxygen to sulphur. 

The ASTRA program can handle a very large volume of data for constructing 
reactions and reaction networks. Astrophysicists normally formulate the reactions by 
hand, and construct the reaction networks by focusing on the more likely reactions by 
using certain domain criteria. It is in this way the hydrogen and helium burning 
processes involving the lighter elements have been dealt with extensively in the 
current literature. But as the number of possible reactions increase rapidly for the 
heavier elements, a complete analysis of the reactions and processes can only be 
carried out with the aid of a computational tool such as our program. Although we 
tested ASTRA on the reactions of the elements from hydrogen (H) to sulphur (^^S) 
with some interesting results, we plan to extend the system for exploring the reactions 
of heavier elements from sulphur to iron C^Fe) and further, which take place in stellar 
and interstellar processes. 

The understanding of the nuclear processes in which the chemical elements are 
formed is important in more ways than one, as this provides detailed information 
about the stellar and interstellar conditions that produced these elements. This is why 
cosmologists and astronomers are also very much interested in these processes as well 
as nuclear astrophysicists. We have described in Section 4, how an analysis of the 
reactions and the reaction processes produced by ASTRA and the natural abundances 
of chemical elements and isotopes, can lead to a detailed picture of the conditions in 
which these elements are formed. 

Astrophysicists use reaction rates to rule out slower reactions from their reaction 
networks. The current version of ASTRA can use reaction rates to rule out candidates, 
retaining only those reactions with the highest rates to construct reaction networks. 
But the rate for each reaction must be given by the user, the program cannot calculate 
them. We attempted to incorporate the rate calculation in ASTRA recently but decided 
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not to gon on with this, because of the complexities involved. Rate calculations are 
based on the reaction cross-sections and element concentrations in stellar media. 
Astrophysicists first construct a model of the star by making a number of assumptions 
about the star size, mass, temperature, pressure and element distribuiton. Stellar 
plasma are also treated in several layers through which element compositions, 
dominant reactions and processes change. 

Although ASTRA can search a much larger space of reactions and processes than 
can human scientists. We did not meet any problems with it for the elements and 
isotopes from hydrogen to sulphur involving 68 distinct entities. We have yet to see if 
we will need to constrain the scope of the reactions for the elements from hydrogen to 
iron. We plan to extend the program to investigate the reactions of the elements from 
sulphur to iron. Meanwhile we will continue to investigate the literature about the 
origins of heavier elements in the solar system. 

6 Related Research 

The ASTRA system has evolved from our previous work in computational study of 
discoveries in particle physics with BR-4 (Kocabas & Langley, 1995), which models 
the discoveries in this field by prediction and theory revision. BR-4 inherits some of 
its capabilities from its predecessor BR-3 (Kocabas, 1991), which in turn descends 
from STAHL (Zytkow & Simon, 1986), and STAHLp (Rose & Langley, 1986) which 
modelled qualitative discovery in chemistry. 

Our system shares goals and techniques with more recent systems MECHEM 
(Valdes-Perez, 1995) designed to discover new reaction mechanisms in catalytic 
chemistry, and SYNGEN (Hendrickson, 1995) which constructs pathways for the 
synthesis of complex organic chemicals from simpler constituents. There are many 
similarities between ASTRA and MECHEM in terms of the tasks they perform. Both 
systems produce reactions and reaction mechanisms in large search spaces, and both 
are designed as computational aids for scientists. But the two systems differ in their 
inputs and outputs. MECHEM receives as input the initial and final chemical 
substances and generates all the simple reaction pathways using a set of constraints on 
chemical reactivity. Similarly, ASTRA uses a set of quantum constraints to formulate 
the reactions from which it constructs the reaction links for each element until the 
final element is reached. The reaction links in a chain constitute what is called by 
astrophysicists ‘the reaction network’. 

ASTRA has to deal with a large number of entities (elementary particles, elements 
and their isotopes), and even much larger number of reactions of these entities, to 
construct valid reactions and reaction chains, while MECHEM has a relatively smaller 
search space in its domain of application. MECHEM’ s reaction pathways are lists of 
reaction steps normally with at most two reactants and two products. In contrast, the 
reactions of ASTRA can have from one to three entities in both sides. 

As to the comparison between our system and SYNGEN, the latter addresses the 
synthesis of organic chemicals, where one needs to determine reaction paths and the 
initial substances, through a set of known intermediate substances. The constraints of 
SYNGEN are more similar to those used by MECHEM though they operate in 
different fields of chemistry. Our program differs from these systems in its field of 
application and the types of constraints used. 
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7 Conclusions 

In this paper we described the new results of ASTRA, a computational tool which 
formulates reactions and reaction chains for researchers in nuclear astrophysics. The 
system determines all valid reactions for a given set of elements, isotopes and particles 
using a set of quantum constraints. The system also generates all reaction pathways 
for an element starting from a set of lighter elements. ASTRA generates all reactions 
we have seen in the astrophysics literature involving proton, electron and neutron 
captures, and helium, carbon, nitrogen and oxygen burning. ASTRA also reproduces 
all reaction chains that scientists have proposed for the synthesis of helium, carbon, 
nitrogen and oxygen in stellar media. But many of the valid reactions and reaction 
chains that the system generates do not appear in the related scientific literature. The 
domain experts that we have contacted suggested that some of these results carry 
theoretical interest for certain stellar models, but the vast majority of the reaction 
chains would be ignored by astrophysicists for their low rates. 

Earlier we decided to incorporate the rate calculations in the ASTRA system, but 
later abandoned this project because of the complexities involved. Instead, we focused 
on extending the system’s knowledge base to investigate the reactions and processes 
of the heavier elements. Given information about 32 more elements and isotopes from 
oxygen to sulphur, amounting to a total of 68 distinct entities, the program generated 
all the proton, electron, neutron capture reactions and all the helium, carbon, nitrogen 
burning reactions. A close comparison of these reactions with the stability and natural 
abundances of the 32 istopes between oxygen and sulphur indicated that the stable 
isotopes in this range must have been formed by exothermic alpha capture reactions 
accompanied by carbon, nitrogen and oxygen burning rather than proton or neutron 
capture reactions. We are currently investigating the literature for any scientific record 
on this issue. 
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Abstract. In this paper we describe BR-4, a computational model of 
scientific discovery in particle physics. The system incorporates oper- 
ators for determining quantum values of known particles, formulating 
new quantum properties, positing new particles, and predicting reactions 
among particles. BR-4 carries out heuristic search guided by constraints 
that its theory be consistent and complete with respect to observed re- 
actions. We show that this control scheme is sufficient to model, with 
some manual intervention, an extended period in the history of particle 
physics, including the discovery of the neutrino and the postulation of 
baryon, lepton, and electron numbers. In closing, we compare BR-4 to 
other discovery systems and suggest directions for future research. 



1 Introduction and Motivation 

Computational research on scientific discovery falls into two broad categories. 
The first, typified by the work of Langley, Simon, Bradshaw, and Zytkow (1987), 
focuses on modeling the processes responsible for discoveries from the history of 
science. The second approach, exemplified by the work of Valdes-Perez (1995) 
and Mitchell, Sleeman, Duffy, Ingram, and Young (1997), uses computational 
methods to discover new scientific knowledge. These two approaches share many 
ideas, and both have made valuable contributions to discovery science, but they 
have distinct goals and criteria for evaluation. 

In this paper we describe results within the first, historical, approach to 
scientific discovery. Like Nordhausen and Langley (1993), we believe that there 
has been important progress in this area, but that most previous models have 
focused on one aspect of the scientific process to the exclusion of others. Like 
them, our goal has been to extend earlier models to account for a broader range 
of scientific enquiry during an extended period in science. We have not tried to 
model the processes in detail or to craft a precise theory of human cognition, but 
rather to provide an abstract but unified account of major activities and their 
order of occurrence. This has required us to develop an integrated framework 
that combines discovery mechanisms in a coherent way. 
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Nordhausen and Langley’s work addressed empirical discovery in physics 
and chemistry, which led their IDS system to integrate mechanisms for forming 
taxonomies, finding qualitative laws, and detecting numeric relations. We have 
focused instead on the more theory-laden domain of particle physics, so that 
our BR-4 system integrates processes for constructing and revising structural 
theories, detecting and formulating problems, generating new theoretical terms, 
and predicting new events. 

In the next section we present our integrated framework for scientific dis- 
covery and its implementation in BR-4. After this, we consider four examples 
from the history of particle physics, showing for each how the system simulates 
discoveries made during the period. These case studies include the postulation 
of the neutrino, the prediction of various reactions, the proposal of baryon and 
lepton numbers, and the discovery of electron and muon numbers. In closing, 
we review related computational work on discovery and consider directions for 
extending our framework. 

2 A Framework for Discovery in Particle Physics 

In this section we present a computational framework for explaining the processes 
that support scientific discovery in particle physics, starting with an analysis 
of the task. We then turn to the representational assumptions that underlie 
our framework, the heuristics that drive the discovery process, and the search 
algorithm that our model, BR-4, uses to explore the space of theories. 

2.1 The Discovery Task 

Particle physics studies the nature of elementary particles - the building blocks 
of matter ~ and interactions among these entities. The basic phenomena in this 
field take the form of reactions, similar in many ways to those found in chemistry. 
For instance, one such ‘observed’ reaction (typically inferred from tracks in cloud 
chambers) \s p + p p + n + tt, where the symbols p, n, and tt represent the 
proton, neutron, and pion particles, respectively. 

As in chemistry, physicists require that reactions among elementary parti- 
cles obey certain conservation laws. One of the main tasks in particle physics 
concerns the assignment of values for quantum properties such that observed re- 
actions conserve those properties. For example, the above reaction conserves the 
quantum property of electric charge, provided we assign the accepted charges 1 
to p, 0 to n, and 1 to tt. Other assignments are possible for this reaction, but 
they would not work for other particles and their observed interactions. 

The notion of conservation also explains why some particle reactions are never 
observed. For example, proton decay, as in the reaction p —>■ e -I- 7 , has never 
been seen, despite its conservation of electric charge. However, one can explain 
its absence by positing that it fails to conserve another quantum property, the 
baryon number. Thus, another central task in particle physics involves explaining 
missing reactions by postulating new quantum properties. 
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Other activities include the inference of new particles, either on theoretical 
or empirical grounds, and the prediction of reactions that involve these particles 
in ways that satisfy known conservation laws. Testing such predictions leads 
into the realm of experimental particle physics, which we will not address here. 
But the above pursuits cover a wide range of the behaviors that occur in this 
scientific field. 

2.2 Discovery Operators and Internal Representation 

The above analysis of the discovery task suggests that four basic operators play 
a central role in particle physics. First, for a given set of particles, quantum 
numbers, and observed reactions, we must be able to determine a set of quantum 
values that satisfy conservation for those reactions. Second, we must be able 
to posit new quantum properties that account for the absence of unobserved 
reactions. Third, we require an operator that posits new particles and their role 
in known reactions. Finally, we need some mechanism for predicting reactions 
that have not yet been observed, but that follow from the current theory. We 
have incorporated these operators into the BR-4 model, where they support the 
process of theory formation and revision. 

Operators of this sort must alter some internal representation that contains 
hypotheses about the particles, properties, and reactions that exist, and that 
also indicates specific quantum values for each pair of property and particle. This 
representation can take many forms, but, following Valdes-Perez, Zytkow, and 
Simon (1993), we can view it as two related matrices. One matrix lists particles 
against quantum properties, with each matrix entry specifying the value for a 
specific particle on a specific property. The other matrix lists particles against 
reactions, with an entry containing the total number of times the particle occurs 
in the reaction. Our operator for determining quantum values alters entries in 
the first matrix, whereas the other operators each extend one or both matrices 
along one of their dimensions. In our examples, we will use the matrix notation 
to specify the properties of particles but not the reactions in which they occur, 
since the latter matrix would be largely empty. 

2.3 Heuristics for Consistency and Completeness 

Naturally, simply formulating the problem in this manner does not solve it. 
Given P particles and Q quantum properties with V values each, there are 
possible assignments of values to particle-property pairs. For small values of P, 
Q, and V , one could search this space exhaustively, but recall that one must 
also consider different numbers for these parameters themselves (i.e., different 
size matrices). In general, constrained search is preferable to blind search, and 
we have incorporated a number of heuristics into the BR-4 system that focus its 
attention in useful directions. 

First, the system considers simpler theories first, starting with one that con- 
tains only directly ‘observable’ particles, quantum properties for which there 
exists separate evidence (such as electric charge), and a few observed reactions. 
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Second, BR-4 alters this theory only when it encounters evidence of some de- 
ficiency, and then it considers only those operators that promise to repair the 
problem. Finally, the model uses constraints on the problem domain, such as 
conservation, to limit the search within the space of repairs. 

More specifically, BR-4’s approach to discovery in particle physics relies on 
the notions of consistency and completeness to constrain the reasoning process. 
For example, the operator for determining quantum values applies only when the 
system detects that an observed reaction is inconsistent with some conservation 
law. In this case, it carries out a depth-first search through the space of values, 
continuing until it encounters a value combination that violates conservation, in 
which case it backtracks. When this process is complete, the resulting quantum 
values are guaranteed to be consistent with all reactions observed so far. To 
keep the process tractable, BR-4 considers only the values 0, 1, and —1 during 
its search.^ 

In some cases, the above revision process cannot eliminate the inconsistency, 
either because no combination of property values leads to conservation across all 
observed reactions or because the quantum values are determined experimentally 
(as for the spin number). This condition leads BR-4 to revise the unbalanced 
reaction by adding a ‘hidden’ particle in either the input or output, positing 
that it actually takes part in the reaction but for some reason is not directly 
observable. The system then computes the property values that would balance 
the reaction and associates them with the new particle. 

The incompleteness constraint leads to complementary behavior. When BR-4 
finds that its current theory fails to rule out a reaction that does not occur, 
it introduces a new quantum property that is not conserved by this reaction 
but that is conserved by those it has observed. Determining the values of this 
property requires search, first for the values of particles in the missing reaction 
(constrained to satisfy an inequality), and then an embedded search for the values 
of other particles (constrained to satisfy equalities corresponding to observed 
reactions). As before, if the system arrives at a partial combination of values 
that rules out an observed reaction or fails to eliminate the unobserved one, 
it considers alternative paths until it finds an acceptable set. In both searches, 
BR-4 considers smaller absolute values before turning to larger ones. 

We can extend the notion of incompleteness to include theories that do not 
explicitly specify all reactions that follow from them, as occurs when one postu- 
lates a new particle. In this situation, BR-4 systematically generates all possible 
reactions of the new particle involving one, two, or three other known particles. 
Some of these reactions take the form of decays, whereas others involve collisions 
among particles. For each such tentative reaction R, the system predicts that 
R will occur if it conserves all known properties, and predicts that the reaction 
will not occur otherwise. 



^ Physicists assign to the spin property not only integers like 0 and 1, but also values 
like I and |. BR-4 also considers these values for this property and, like physicists, 
calculates the spin number using group theory. 
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Table 1. The quantum values for elementary particles known (a) in 1930, prior to 
experimental detection of the neutron, and (b) after postulation of the neutrino. 





Particle 


mass 


charge 


Spin 


(a) 


7 


0.00 


0 


1 




e 


0.51 


-1 


1 

2 




P 


938.26 


1 


1 

2 




e 


0.51 


1 


1 

2 


(b) 


n 


939.55 


0 


1 

2 




V 


0.00 


0 


1 

2 



3 Illustrative Examples from Particle Physics 

In this section we consider four examples of discovery from the history of par- 
ticle physics, involving the neutrino, baryon and lepton numbers, and electron 
and muon numbers. In each case, we recount the main historical events, and 
then examine BR-4’s behavior when presented with similar observations. Our 
historical treatment is based upon a number of sources on particle physics, in- 
cluding Griffiths (1987), Ne’eman and Kirsh (1986), Omnes (1970), Pais (1986), 
and Trefil (1980). 

3.1 Discovery of the Neutrino 

Until the early 1930’s, scientists had observed only a few elementary particles, 
shown in Table 1 (a) along with their mass and their values on three conserved 
quantum properties - energy, charge, and spin. The known reactions were also 
limited to a small set: p -bp ^ p+p, e + e 7 , and 7 ^ e -I- e. This situation 
changed after Chadwick’s experimental detection of the neutron in 1932, which 
also clarified another outstanding issue (Giancoli, 1995). 

Much earlier, physicists had observed beta decay, a process in which an ele- 
ment emits an electron and is transformed into another element with a higher 
atomic number. This transformation appeared to violate conservation of both 
energy and spin, leading Bohr to suggest that these properties are truly not con- 
served within the nucleus. However, in 1930, Pauli proposed another explanation 
- that beta decay also emitted another particle that was difficult to detect. 

Chadwick’s experiments also revealed neutron decay, n p + e, which 
occurs in about 800 seconds on free neutrons. Like beta decay, this reaction ap- 
peared to violate energy and spin conservation, but in simplified form. Again, 
Pauli’s account avoided this problem by postulating a new particle, also gener- 
ated during the decay reaction, that would balance out the missing energy and 
spin. In 1934, Fermi formalized this proposal for the neutrino, which he posited 
as having zero rest mass, no electrical charge, and a spin of one half. 
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Table 2. Particle reactions that were (a) observed and (b) not observed in experiments 
after the introduction of the particles in Table 1 (b). 



(a) Observed reactions 


(b) Unobserved reactions 


p + p ^ p + p 


p ^ e -1- 7 


e + e ^ 


p ^ e -1- e -|- e 


'y ^ e + e 


p ^ e -1- 7 -1- 7 


'y+p^e + e + p 




n —> p + e + n 





Given the four reactions above and the quantum numbers in Table 1 (a), BR-4 
responds in a similar manner. The system immediately detects an inconsistency 
concerning the spin values for neutron decay and attempts to correct it. (The 
current program does not address the issue of energy conservation.) BR-4 cannot 
modify the spin counts of the particles in the reaction, as these values are marked 
as having been established by observation. This leaves revision of the unbalanced 
reaction as the only solution. 

One such revision adds an extra particle to the output side of the reaction, 
giving n p + e + v. Using the conservation laws as constraints, the system 
computes the mass, charge, and spin of the new particle, u, as 0.0, 0, and 
respectively. Another possible revision would have added a new particle with the 
opposite spin to the input side of the reaction. However, we believe physicists 
favored the former solution because they were thinking in terms of a decay 
process, so we have biased BR-4 in this direction as well. 

Our treatment of this episode ignores many details, including the role that 
conservation of energy, in addition to spin, played in driving proposals for the 
neutrino. But the general line of reasoning, that a new particle with certain 
quantum values was needed to preserve conservation, appears historically accu- 
rate, and BR-4’s heuristics arrive at the same description for this particle as did 
Fermi and his colleagues. 

3.2 Proposing the Baryon Number 

The inference of the neutrino left physicists with six elementary particles, having 
the properties and values shown in Table 1 (a) and (b). Scientists realized that 
the existence of these particles, combined with the existing conservation laws, 
implied a variety of reactions. Subsequent experiments revealed evidence for 
some of these reactions, shown in Table 2 (a), but not for some others, shown in 
Table 2 (b). For some reason, the three predicted decays of the proton did not 
occur in nature; to remedy this problem, physicists proposed a new quantum 
property, known now as the baryon number. 

Given the six particles in Table 1, our model follows a similar line of reason- 
ing. BR-4 realizes that its current theory is incomplete, so it predicts all decay 
and collision reactions involving these entities (up to length three) that conserve 
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Table 3. The quantum values of particles known in 1953, after discovery of baryon 
and lepton numbers. 



Particle 


mass 


charge 


Spin 


baryon 


lepton 


7 


0.00 


0 


1 


0 


0 


e 


0.51 


-1 


1 

2 


0 


1 


P 


938.26 


1 


1 

2 


1 


0 


n 


939.55 


0 


1 

2 


1 


0 


e 


0.51 


1 


1 

2 


0 


-1 


V 


0.00 


0 


1 

2 


0 


1 


M 


105.60 


-1 


1 

2 


0 


-1 


A 


105.60 


1 


1 

2 


0 


1 


7T 


139.60 


1 


0 


0 


0 


7f 


139.60 


-1 


0 


0 


0 


7T0 


135.00 


0 


0 


0 


0 



charge and spin, giving the seven reactions^ in Table 2. These correspond to 
proposed experiments with the particles, or at least to suggestions for what to 
look for in such experiments. When informed that the reactions in Table 2 (a) 
occur but those in (b) do not, BR-4 infers that its theory is incomplete in a 
deeper sense and proposes a new property to correct the situation. 

To determine the values of this new property, BR-4 selects one of the missing 
reactions, say p ^ e -I- 7 , and turns it into a set of inequalities, each based 
on a different combination of values for the particles involved. In this case, it 
generates the four inequalities lyfQ-l-O, lyf 1 - 1 - 1 , 07 ^ l-bO, and 0 yf 0 -b 1. The 
system then selects one of these value sets, say the first, {p = 1 , e = 0 , 7 = 0 }, 
and inserts them into one of the observed reactions, say n ^ p + e + v, this time 
treating it as an equality. 

In this case, BR-4 obtains the expression n = 1 + Q + v, which leaves the 
property values for n and v unspecified. Two consistent value sets are possible for 
this pair, {n = 1, v = 0} and {n = 0, = —1}. BR-4 selects the first and uses it 

to check the observed reactions, introducing values for the remaining unassigned 
particles as necessary. Detection of an unbalanced reaction that violates conser- 
vation of the new property causes backtracking to one of the alternative value 
sets. If the search exhausts all such sets produced from the observed reactions, 
the system backtracks further and considers other value sets generated from the 
unobserved reactions. 

^ BR-4 also generates two other reactions, besides n ^ p+e+u, that involve neutrinos: 
u+p ^ n + e and v + n ^ p + e. However, physicists showed little concern when they 
did not immediately detect these reactions, presnmably because theory predicted 
that neutrinos interacted very rarely. Thus, we told the system to ignore them at 
this stage of our simulation. 
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Table 4. Some particle reactions that were (a) observed and (b) not observed in 
experiments after the discovery of mesons. 



(a) Observed reactions 


(b) Unobserved reactions 


jj, —> e + n + n 


M ^ e -1- 7 


n ^ e + n 


p —> e + e + e 


n ^ p, + n 


no ^ e + p 


n ^ no + e + n 


no ^ p + e 


no ^ e + e 




no ^ n + n 




TTO ^ 7 + 7 





Given the experimental results in Table 2, BR-4 arrives at the value zero for 
all particles except the proton and neutron, to which it assigns the value one, 
as shown in the first six rows of Table 3. These settings correspond to those 
obtained by physicists for the baryon number, which successfully explain the 
absence of the reactions in Table 2 (b), since they fail to conserve this property. 
As new particles become known, BR-4 assigns baryon values to them as well, 
using the same search mechanism. 

3.3 Mesons and the Lepton Number 

In 1935, Yukawa proposed the existence of additional particles in the nucleus, 
with a mass of about 100 MeV. The reasoning behind Yukawa’s proposal, which 
we have not attempted to model, involved energy calculations on atomic nuclei. 
Later, in the 1940s, observations of cosmic rays revealed five such particles: 
the muon (/r) and anti-muon (/l), the pion (tt) and anti-pion (tt), and the pion 
zero (tto). These suggested a variety of reactions, some that were observed by 
scientists and others that were not. 

Konopinski and Mahmoud (1953) attempted to explain the mismatch be- 
tween theory and data, focusing on the five detected reactions /i — > e + v + v, 
fji + 1 / e + u, p + ^ n + V, V + n ^ p + and v + n ^ p + e and 
on the single unobserved reaction /x — > e -I- 7. In order to explain the absence 
of this decay, they proposed a new quantum property, the lepton number, with 
nonzero values for the muon, the electron, the neutrino, and their antiparticles. ^ 
However, Konopinski and Mahmoud assumed that the muon in the reactions 
was an antiparticle, which led them to assign it the lepton value —1. With the 
introduction of the lepton number, physicists had produced a theory, equivalent 
to that depicted in Table 3, that appeared consistent and complete. Many scien- 
tists had reservations about Konopinski and Mahmoud’s theory, but it was the 
best account available at the time. 

® Pais (1986) claims that he suggested the lepton number, including its name, earlier, 
in 1947, based on an analogy with the baryon nnmber for heavier particles. 
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Table 5. Particle reactions that were (a) observed and (b) not observed in experiments 
after distinguishing between electron neutrinos (r'e) and muon neutrinos (Vfi)- 



(a) Observed reactions 


(b) Unobserved reactions 


fl 


e + Ue + jyij, 




fl 


e + Ue + Vij, 


Vfj, +p ^ n + e 


7T ^ 




v^+n^ p + e 


7T ^ 






7T ^ 


p. + v^ 




7T ^ 


110 + e-\-v^ 




Tfo ^ e + e 




TTO ^ I'll + 




TTQ ^ 7 + 7 





BR-4 responds to the introduction of mesons in a similar manner. Given the 
five new particles, it predicts a variety of reactions, including four muon decays, 
five pion decays, and ten reactions that involve the pion-zero. Table 4 shows a 
sample of these predictions, some (a) that were observed and others (b) that 
were not. These differ somewhat from the ones addressed by Konopinski and 
Mahmoud, who presumably did not mention the observed decays that had been 
known since 1947 (Griffiths, 1987, p. 19, p. 25) and may have ignored some 
unobserved ones because the values for the lepton number forbid them. 

Upon finding that the predicted reaction /r ^ e + 7 has not been observed, 
BR-4 attempts to introduce a new property with values that rule out this in- 
teraction. However, the system cannot find a consistent set of values for this 
property if, as usual, it considers only zero and positive values. For BR-4 to 
follow Konopinski and Mahmoud’s reasoning, we must tell it (as the physicists 
concluded) that ^ is an anti-particle, which lets the system consider negative 
quantum values. Table 3 shows the values generated by the system when given 
this assistance; they correspond to those inferred by Konopinski and Mahmoud, 
with the exception that /x and fl are reversed. 



3.4 Electron and Muon Numbers 

In the year 1953, another important development took place. Additional experi- 
ments revealed indirect evidence for the predicted reaction ly + p ^ n + e, which 
obeyed all known conservation laws and thus was required for the theory to be 
complete. Yet this reaction occurred when the neutrino (v) had been generated 
through beta decay {n p+e+v), but not when produced through muon decay 
{p ^ e + v + u). 

To resolve this dilemma, scientists postulated that the two reactions actually 
generated two distinct types of neutrinos, calling the former an electron neutrino 
(jZe) and the latter a muon neutrino {v^). This distinction (and the analogous 
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one for anti-neutrinos) introduced two additional rows in the table of particles. 
However, it also produced the unobserved reactions shown in Table 5 (b), which 
physicists sought to explain by introducing yet another property and which they 
named the electron number. 

Our model cannot directly explain the historical distinction into two classes of 
neutrinos, but we believe it constitutes a variation on the heuristic for postulating 
new particles that originally led to inference of the neutrino. The situation also 
bears some similarity to the distinction inferred by Mendel 1865 to explain the 
different offspring of apparently identical peas, which Shen and Simon (1989) 
have modeled using a related mechanism. Langley et al. (1987) have used a 
similar technique to explain distinctions that occurred in the history of chemistry. 

Once this difference has been introduced manually, BR-4 realizes that its 
current theory is incomplete, in that it cannot explain the unobserved reactions. 
Postulating a new property, it searches the space of values using the same process 
as it used for the baryon and lepton numbers. The resulting values agree with 
those proposed by physicists for the electron number, and they are sufficient to 
rule out the two unobserved muon reactions shown in Table 5 (b). Physicists also 
postulated yet another quantum property, called the muon number, on grounds 
of symmetry between electrons and muons. However, lacking any heuristics of 
this sort, BR-4 cannot reproduce this step in the human scientists’ reasoning. 

3.5 BR-4 as a Historical Model 

We have implemented BR-4’s operators and heuristics in Prolog, and we have 
verified the system’s ability to reproduce the historical discoveries reported ear- 
lier. In each case, we gave the system a set of particles, a set of known quantum 
properties, the hypothesized values for those properties, and a set of observed and 
unobserved reactions; in response, BR-4 generated the revised values, new par- 
ticles and properties, and predicted reactions we have described. These formed 
a partial basis for the next inputs to the system, giving historical continuity to 
the model’s behavior. 

The resulting chain of reasoning carries BR-4 through more than two decades 
of major discoveries in particle physics. Moreover, the system relies on mecha- 
nisms that are consistent with our knowledge about the nature of human cog- 
nition. In particular, it carries out a limited heuristic search through a space of 
models that is guided both by knowledge about the domain and by observations. 
Moreover, this process occurs in an incremental fashion, with the system revis- 
ing previous models as new phenomena become available and with new results 
becoming background knowledge for the next round of discovery. 

As we have noted, BR-4 does not explain all of the major events in particle 
physics, even during the period we have attempted to simulate. In a number of 
cases, we had to intervene manually at selected points beyond the insertion of 
information about the outcomes of predictions. These steps included telling the 
system to ignore some unobserved reactions involving neutrinos, to assume that 
the muon is an antiparticle with nonpositive quantum numbers, and introducing 
the distinction between electron and muon neutrinos. Also, the system explains 
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the historical sequence of events at a quite abstract level that ignores many 
details which occupied particle physicists’ time and energy. 

Thus, although BR-4 has let us model an extended period in the history of 
science, it remains an incomplete account. Each situation that required interven- 
tion suggests the need for additional mechanisms that should let its successor 
better match the historical record. These should include heuristics for ignoring 
predictions that are too difficult to observe, for considering wider ranges of quan- 
tum values, and for discriminating particles that appear the same but behave 
differently. Each such extension seem as general, at least in principal, as the 
existing operators and heuristics on which BR-4 relies. 



4 Related Work on Computational Scientific Discovery 

Our computational model of discovery draws many of its ideas from earlier work 
in this area. BR-4 is a direct descendant of Zytkow and Simon’s (1986) Stahl, 
which modeled a variety of qualitative discoveries in the history of chemistry. 
The detection of inconsistencies in reactions played an important role in this 
system, with one of its responses being the introduction of new elements like 
phlogiston, which served much the same role in early chemistry as the neutrino 
did in particle physics. 

Rose and Langley (1986) described STAHLp, a rational reconstruction of the 
earlier system that showed all of its discoveries could be explained in terms of in- 
consistencies and their resolution. In addition, they used STAHLp and Revolver 
(Rose & Langley, 1988), a similar system, to model a number of other reaction- 
oriented discoveries from the history of science, including some from particle 
physics. Moreover, their approach showed that dependency-directed reasoning 
simplified the theory-revision process, letting their systems handle problems with 
a search-control scheme that relied on incremental hill climbing rather than more 
systematic search. 

Kocabas’ (1991) BR-3 system extended this framework to include the detec- 
tion of incomplete theories and the postulation of new properties to explain the 
absence of reactions. He applied this idea to the history of particle physics, using 
it to explain the origin of several quantum numbers and the particular values 
assigned to them. In related work, Kocabas (1992) adapted similar methods to 
discovery in the area of superconductivity. BR-3 was the immediate precursor of 
BR-4, with the former differing mainly in that it lacked the ability to postulate 
new particles and to predict new reactions. 

Valdes-Perez (1994) has described an alternative approach to discovery in 
particle physics, which he implemented in his Pauli system. This scheme uses a 
variation on linear programming to search the space of property values, subject to 
constraints that reflect observed and unobserved reactions. In addition, Fischer 
and Zytkow (1992) have reported on Gell-Mann, a system designed to explain 
the formation of the quark theory, which also carries out a form of constraint- 
satisfaction search to determine parameter values. Both systems have generated 
interesting models that differ from those found by human scientists, but these 
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results, combined with their more powerful and nonincremental search methods, 
make them less plausible as historical accounts than the Stahl, STAHLp, BR-3, 
and BR-4 systems. 

Despite their differences, each of these systems fits nicely within the frame- 
work proposed by Valdes-Perez, Simon, and Zytkow (1993), which characterizes 
the discovery process in terms of operations on two related matrices. The various 
programs differ in their operators for altering the matrices, with BR-3 and BR-4 
adding steps for introducing a property, predicting reactions, and positing a par- 
ticle. Pauli and Gell-Mann also explore a matrix space but invoke different 
search regimens for selecting operators. 

Other research on scientific theory revision, such as Rajamoney’s (1990) work 
on theory-guided experiment generation in physics, seems less closely related. 
However, Kulkarni and Simon’s (1990) Kekada integrates theory revision, ex- 
periment design, and problem formulation to model Krebs’ discovery of the urea 
cycle. The system includes heuristics for making predictions, redirecting atten- 
tion when they are violated, and designing experiments to determine the un- 
derlying cause. The Kekada work comes the closest to our own in spirit, in 
that both involve modeling an extended period in the history of science, rather 
than isolated events. However, Kulkarni and Simon’s model operates at a finer 
granularity and better matches the historical details than does BR-4. 

5 Directions for Future Research 

Although BR-4 provides an abstract account for some important developments in 
particle physics, there remains considerable room for improvement. One problem 
is that the model’s coverage of the historical process remains far from continu- 
ous. A more complete account would incorporate knowledge about the difficulty 
of detecting some reactions to explain why scientists chose to ignore some unob- 
served interactions (e.g., those involving neutrinos) while focusing their attention 
on others (e.g., those concerning proton decay). We should futher reduce reliance 
on human intervention by adding an operator like the one described by Shen and 
Simon (1989) that introduces a distinction between particles (e.g., electron and 
muon neutrinos) based on behavioral differences observed over time. Heuris- 
tics for proposing new particles and quantum properties on theoretical grounds 
would further strengthen the model. 

We also hope to extend the system to introduce componential models, which 
describe particles at one level as combinations of more primitive ones. Langley 
et al.’s (1987) Dalton took some steps along these lines to explain relations 
between chemical molecules and elements, but we can incorporate similar meth- 
ods into BR-4 to explain the origins of the quark theory and its alternatives. 
The basic task involves explaining why elementary particles with some quan- 
tum properties exist while others do not. BR-4’s constraints of consistency and 
completeness seem well suited for this problem, which involves postulating new 
component particles (quarks), then searching the space of quantum values and 
their compositions that satisfy certain constraints (such as symmetry) for known 
particles and that violate these constraints for nonexistent ones. 
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Finally, although BR-4 implicitly models social aspects of the discovery pro- 
cess by addressing extended periods to which multiple scientists contributed, it 
accomplishes this at a very abstract level. A more detailed account of social in- 
teractions would include explicit communication among particle physicists, with 
theorists passing on predictions to experimentalists, who in turn report their 
observations to theorists. An extended model would also support competition 
in the development of theories to explain new findings and in finding evidence 
for predicted events. The history of particle physics is rich in examples of such 
interactions, and we believe that appropriate revisions to BR-4 would let us 
model at least some of them. To this end, we should assign different facets of 
the system’s domain knowledge to different agents, which would communicate 
through a common representation; in addition, separate agents would explore 
different branches when the search process suggests alternative solutions. 

6 Concluding Remarks 

In this paper we presented BR-4, an integrated model of historical scientific 
discovery. We examined the system’s behavior on four major problems that arose 
in particle physics, showing that it can replicate important steps in the historical 
development of this field, some of which were considered major discoveries when 
first introduced. In particular, BR-4 proposes the existence of the neutrino to 
avoid violating conservation of spin, it introduces baryon and lepton numbers 
to explain the absence of reactions involving proton decay, and it postulates 
electron numbers to rule out unobserved neutrino reactions. The system also 
finds appropriate quantum values for each particle and predicts the reactions 
implied by a set of particles and properties. 

The BR-4 model achieves these results using simple processes that appear 
to have considerable generality. The system employs four basic operators for 
determining the values of a quantum property, creating new properties, posit- 
ing new particles, and predicting reactions among known particles. Moreover, it 
uses consistency and completeness constraints to selectively apply these opera- 
tors, and it incorporates a depth-first control scheme to carry out search when 
necessary. These activities operate in a continual loop, with incorrect predictions 
leading to revised models, which then become the starting point for new discov- 
eries. Together, they let BR-4 explain, with occasional aid from its developers, 
an extended period in the history of particle physics. The simplicity and gener- 
ality of these mechanisms suggest that we can explain other aspects of scientific 
discovery in similar terms, and we hope to test that hypothesis in future work. 
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Abstract. Innovation is critical for maintaining competitive advantage 
in a high tech global economy, especially for organizations or nations that 
do not possess low cost labor forces. Many studies on innovation attempt 
to identify endogenous and exogenous variables that impact innovation 
[7], in order to better understand the environment that promotes inno- 
vation. The author’s recent efforts have focused on developing processes 
for enhancing innovation that exploit the transference of information and 
insights among seemingly disparate disciplines. 

The objective of this paper is to describe and demonstrate a hybrid 
tandem literature-workshop approach to innovation that eliminates the 
weaknesses but retains the strengths of each component. The literature- 
based component identifies the technical disciplines related to the central 
technical theme of interest, the experts in these disciplines, and promising 
candidate concepts for innovative solutions. These outputs define the 
agenda and participants for the workshop-based component. An example 
of this combined approach is presented for the theme of Autonomous 
Flying Systems. The hybrid approach appears to be an excellent vehicle 
for enabling innovation. However, it requires substantial time and effort 
in both phases. 



1 Introduction 

The process of innovation is of immense social interest and impact, has been 
studied extensively, and yet remains poorly understood. A critical factor in many 
instances of innovation is the transfer of information and understanding devel- 
oped in one or more disciplines to other, perhaps very disparate, disciplines. With 
the explosion in availability of information, scientists and technologists find it 
increasingly difficult to remain aware of advances within their own discipline(s), 
much less in other seemingly unrelated ones. As science and technology become 
more specialized, the incentives for interdisciplinary research and development 
are reduced, and this cross-discipline transfer of information becomes more dif- 
ficult. The author’s observation, from examination of many science and tech- 
nology (S&T) sponsoring agencies and performing organizations and technical 
journals, is that strong cross-disciplinary dis-incentives exist at all phases of pro- 
gram/project evolution, including selection, management and execution, review, 

* The views expressed in this paper are those of the author and do not represent the 
views of the department of the Navy. 
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and publication. To overcome cross-discipline transmission barriers, and thereby 
enhance innovation, systematic methods are required to heighten awareness of 
experts in one discipline to advances in other disciplines. Most desirable are 
methods that incorporate/require cross-disciplinary access as an organic compo- 
nent. 

This paper presents two different, yet complementary, approaches to increase 
cross-discipline knowledge transfer and provide the framework for enhancing 
innovation. One is literature-based, the other is workshop-based. Each approach 
individually represents a major advance in enabling innovation and discovery, 
and the hybrid of the two approaches provides a synergy that multiplies their 
combined benefits. 

The literature-based approach is summarized first, followed by the workshop- 
based approach. The advantages of combining the two approaches are then pre- 
sented. The details of each approach are presented in the appendices. 



1.1 Accessing Linked Literatures for Enhancing Innovation- 
Summary 

The first approach searches for relationships between linked, overlapping litera- 
tures, and discovers relationships or promising opportunities not obtainable from 
reading each literature separately. The general theory behind this approach, ap- 
plied to two separate literatures, is based upon the following considerations [18]. 

Assume that two literatures with disjoint components can be generated, the 
first literature AB having a central theme a and sub-themes b, and the second 
literature BC having a central theme(s) b and sub-themes c. From these combi- 
nations, linkages can be generated through the b themes that connect both lit- 
eratures (e.g., AB BC). Those linkages that connect the disjoint components 
of the two literatures (e.g., the components of AB and BC whose intersection 
is zero) are candidates for discovery, since the disjoint themes c identified in 
literature BC could not have been obtained from reading literature AB alone. 

Some initial applications of the first approach have been published in the 
medical literature [18]. One interesting discovery was that dietary eicosapen- 
taenoic acid (theme a from literature AB) can decrease blood viscosity (theme 
b from both literatures AB and literatures BC) and alleviate symptoms of Ray- 
naud’s disease (theme c from literature BC). There was no mention of eicosapen- 
taenoic acid in the Raynaud’s disease literature, but the acid was linked to the 
disease through the blood viscosity themes in both literatures. Subsequent med- 
ical experiments confirmed the validity of this literature-based discovery [2]. (A 
web site [17], overviews the process used to generate this discovery, and contains 
software that allows the user to experiment with the technique. A 1998 article 
in The Scientist outlines perceptions of different knowledgeable individuals on 
Swanson and Smalheiser’s general technique [1]. 

This literature-based discovery approach is in its infancy. Public and private 
financial support for this technology are minimal. It is an area that seems to 
have fallen through the cracks. There is essentially one group that is publishing 
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results of literature-based innovation and discovery in the credible peer-reviewed 
literature [18,20,19,15,16,17], and two groups that have published concept pa- 
pers [3,10]. Presently, the approach is not automatic. It requires much thought, 
expertise, and effort. The author’s group is examining different approaches to 
make the process more systematic, while reducing the manual labor intensity. 
Given the potential benefits of the literature-based approach for stimulating 
innovation, it is truly a technology whose time has come. 

Appendix A generalizes and expands upon the literature-based approach, 
using the Database Tomography (DT) techniques and experience developed by 
the author since 1991 [4,5,8,10,14,13]. It outlines the theory of the expanded 
approach, the implementation details, and overviews the range of applications 
possible with this technique. 

1.2 Interdisciplinary Workshops for Enhancing Innovation — 
Summary 

The second approach consists of convening workshop(s) of experts from different 
disciplines focused on specific central themes. The purpose of such a workshop 
is to achieve multi-discipline synergies and cross-discipline transfers to gener- 
ate promising research directions for these central themes. The theory behind 
this approach is described in Appendix B. To test this theory, a workshop on 
Autonomous Flying Systems was convened in December 1997, and the imple- 
mentation mechanics and results are described in detail in Appendix B. 

The total workshop process consisted of three phases: 

(1) a two month pre-meeting e-mail phase in which each participant provided 
descriptions of advanced capabilities and promising research opportunities 
from his/her discipline to all other participants; 

(2) a two-day meeting at the Office of Naval Research (ONR) during which the 
promising opportunities identified beforehand were discussed, crystallized, 
and enhanced; and 

(3) a post meeting e-mail phase in which each participant provided additional 
or embellished opportunities. 

A number of important lessons were extracted from the conduct of this work- 
shop, and they can be summarized as follows: 

(a) The workshop approach broke new ground toward stimulating innovative 
thought. It was not easy, simple, or effortless, and required substantial plan- 
ning and work in order to be effective. One should not throw people from 
15 different disciplines together in a room for two days and hope to get 
new ideas synthesized. There needs to be a common generic thread woven 
through the different disciplines represented to spark the innovative thought 
process. 

Interdisciplinary workshops, when performed correctly, are the wave of the 
future in defining new research (and technology) areas and approaches. Be- 
cause of the intensity and effort involved throughout the process, they are 
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most appropriate for large scale “grand challenges” in full-blown workshop 
form, but appropriate as well for smaller scale issues. 

Representatives from diverse technical disciplines, organizations, and devel- 
opment categories attended the workshop. There was substantial value in 
having a balance of discipline, category, and organization diversity at the 
same meeting. The different perspectives presented benefited all participants. 
The use of modern information technology can expand the degree of diver- 
sity dramatically. Some of the concepts and group software proposed for 
network-centric peer review [12] can be easily adapted for use in innovation 
workshops. This would allow many more people, disciplines, and organiza- 
tions to be represented, further enhancing the potential for cross-discipline 
information transfer and resultant innovation and discovery. 

Problem selection is crucial. The problem should be sufficiently general that 
many diverse disciplines can link to it. Given the choice of equally relevant 
problems, there is more potential for impact in selecting problem areas for 
which a large interdisciplinary community is not yet obvious. 

It is important to select participants by the most objective processes avail- 
able. A combination of expert recommendation and strategic topical maps 
based on computational linguistics, publications, and citations was used for 
the selection process, and this approach produced highly knowledgeable in- 
dividuals. Incorporation of the full literature-based approach to innovation 
in the discipline or participant selection process could further enhance con- 
fidence that the most appropriate mix of disciplines and experts has been 
chosen. 

It is extremely important that individuals selected for participation be world- 
class experts in their particular areas. There are relatively very few individ- 
uals producing the seminal works in any field [8,9], and it is these people 
who should be central to any truly innovative workshops. However, in addi- 
tion to these established experts, highly competent individuals new to the 
field should also be selected. One benefit of transcending selection of known 
experts is that fresh faces who are new to established communities appear. 
They can sometimes challenge established paradigms and offer concepts typ- 
ically not advanced through panels based solely upon well-known, over-used 
panelists. 

The e-mail component of the workshop is crucial. The gestation period be- 
tween the input of promising ideas and their actual discussion at the work- 
shop allows consideration of many different approaches and syntheses. It 
also saves substantial time at the workshop by clarifying confusing issues 
beforehand. However, in the first experience reported here, the stimulation 
of dialogue in the e-mail phase among most of the participants did not occur. 
The only participant to raise questions was the author, and this occurred 
only a few times. Nonetheless, in these instances, the dialogue was extremely 
valuable in clarifying issues and surfacing points of contention. In future 
workshops, it is strongly recommended that a few individuals representing 
different disciplines be asked to assume a role of facilitator, with the task 
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of stimulating dialogue and raising questions during the workshop build-up 
phase. 

(g) All the attendees at the workshop were required to participate; there were 
no pure observers. This meant that they had to submit accomplishments and 
opportunities statements by e-mail. They also had to be prepared to lead 
discussions at the workshop. This participation requirement was valuable in 
that each attendee obtained a sense of ownership in the workshop and its 
outcome. His/her contribution tended to be more substantive and creative 
than is typically the case at standard workshops. Those who contributed 
more in the e-mail phase tended to contribute more in the workshop phase. 
In addition, there was a sense of equality among participants when all were 
required to contribute, as opposed to an audience/performer environment 
with passive onlookers. On the other hand, the downside of requiring atten- 
dees to be active participants was that attendance had to be limited. This 
may not be totally bad, since audience participation can substantially enrich 
workshop discussion. 

(h) In general, there needs to be some incentive to motivate participation of 
world-class experts in these workshops. Unless they are able to envision 
some type of substantive impact resulting from their participation, either on 
larger S&T issues or in their individual disciplines, they could be reluctant 
to invest the substantial amount of time required for serious participation. 
This, however, did not turn out to be a problem for the Autonomous Flying 
Systems workshop, apparently because of the limited size of the field and 
the interest of the participants in the type of workshop conducted. 

In addition, during the workshop, participants did not appear to have reluc- 
tance in sharing new concepts. This is in stark contrast to some workshops 
the author has attended where novel ideas were held very closely. In the Au- 
tonomous Flying Systems workshop, there was a spirit of comeraderie and 
cooperation that pervaded the proceedings, and helped overcome the barriers 
to sharing. This spirit was fostered in the pre-meeting e-mail dialogue phase, 
and further nurtured during the meeting by having all attendees participate 
in the proceedings as equal partners. 

Finally, interdisciplinary workshops are a powerful potential source of radi- 
cally innovative ideas if conducted properly. There are three central require- 
ments for success: 

(1) A problem of significant interest to the sponsoring organization must be 
selected; 

(2) An optimal mix of world-class experts appropriate to the problem must 
be chosen; 

(3) Conditions must be created which will motivate the participants to share 
their novel concepts. 

The Autonomous Flying Systems workshop addressed these three require- 
ments to a significant degree. A preliminary concept proposal emerged, and 
a copy of this proposal is available from the author. 
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2 Need for Literature/ Workshop Synergy 

Most organizations use some variant of a workshop/group dynamics approach 
for brain-storming or other proxies for stimulating innovation. The most current 
information is available, and real-time information exchange is unmatched. The 
attendees and participants in these groups tend to be focused subject experts 
representing a small fraction of the relevant technical community; there is rarely 
any complementary sophisticated literature analysis performed, and there are 
rarely experts present from strongly divergent disciplines. The outputs and dis- 
cussion are highly subjective. The workshop techniques tend not to make full 
use of many of the information technology advances of recent years. Probably 
most importantly, there are strong disincentives for the participants to reveal 
the latest innovations. What many workshops produce in practice are forums for 
“selling” completed or near-completed efforts. 

A few performers, individuals or small groups of individuals, pursue the 
literature-based computer-assisted approach. This literature approach tends to 
be more sophisticated and technologically advanced than the workshop ap- 
proach, and is more objective. It is more comprehensive, since it encompasses 
S&T beyond the scope of any individual, or group of individuals, and can ac- 
cess data from many technical disciplines and many global sources. The base 
data is not as current as the workshop approach, due to the documentation time 
lag. However, with the advent of extensive on-line documentation, this time lag 
has been reduced considerably. One intrinsic limitation is that a relatively mod- 
est amount of S&T performed globally is documented and readily accessible to 
the wider user community [11]; obviously, any S&T not documented cannot be 
accessed. The literature-based approach has not received widespread attention 
and may fall short of the interpretive and analytical strengths of the workshop 
approach. As a result, the literature approach is not widely used (e.g., [1]). 

While either the workshop approach or the literature approach can be done 
independently to help stimulate discovery, they should be done in tandem to 
maximize the benefit provided by each. There is nothing on record to indicate 
that this joint approach to innovation has been implemented, or even considered. 
The Autonomous Flying Systems workshop described in this paper has some el- 
ements of the combined approach. Some of the DT proximity analysis tools were 
used to identify the scope of related literatures, and the prolific individuals in 
these literatures. These individuals were then invited to the workshop. However, 
time constraints precluded using the full capabilities that the literature-based 
approach can offer. 

In a joint workshop-literature effort, the literature approach would be in- 
cluded in the background pre-meeting phase of the workshop approach (as de- 
veloped in Appendix B). Accordingly, the literature study would provide: 

(1) background reading for the workshop participants in related yet disparate 

science and technology areas; 

(2) strategic maps of the broader science and technology literature as outlined 

in the DT papers referenced above; 
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(3) promising opportunities for innovation and discovery; and 

(4) the disparate science and technology disciplines from which the experts for 

the workshop could be drawn. 

The hybrid literature-workshop approach would eliminate the limitations of 
each approach done separately. The right people from the right combination of 
disciplines could be identified by the literature-based approach, and invited to 
the workshop. The literature-based analysis could structure the technical rela- 
tionships, and provide an objective starting point for discussion. Network-centric 
peer review would allow linking, and fusing information from, large numbers of 
reviewers to incorporate more representative opinion sampling from the larger 
technical community. The only limitation not overcome is the disincentives for 
the participants, or document authors, to reveal their latest S&T efforts and 
innovations. 

There is extra time and cost involved with two approaches, and if responses 
were required with severe time limitations, then only one approach might prove 
feasible. For organizations that are serious about innovation and discovery, the 
additional time should not be a factor, given the potential high marginal bene- 
fits. Government could probably draw upon a more eclectic group than industry. 
Because of the competitive aspects, industry would probably rely more upon in- 
ternal participants and contracted consultants, whereas government would draw 
upon individuals from many organizations. 



3 Conclusions 

The advent of large databases, and the parallel advances in computer hardware 
and software, provide the opportunity to augment and amplify traditional ap- 
proaches of human creativity in generating discovery and innovation. This paper 
has shown that multi-discipline structured workshops can enhance the S&T in- 
novation process, and has shown that multi-discipline literature-based analyses 
can enhance the S&T discovery process. The document has shown conceptu- 
ally that the combination of computer-enhanced literature-based analyses and 
multi-discipline structured workshops has the synergistic potential to dramati- 
cally improve the discovery and innovation process relative to the already strong 
capabilities available from each process separately. This literature-workshop syn- 
ergy represents a potential major breakthrough for systematically identifying: 1) 
the most promising disciplines to be used in the workshop; 2) specific experts 
from these different disciplines; 3) candidate promising concepts that form the 
basis for discussion. 
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A Literature Approach 

A.l Overview 

The theoretical basis of the literature approach mirrors the scientific process 
in many ways. Information from diverse literatures, with relevant interfaces, is 
examined. All information is first analyzed and then synthesized to produce 
discovery and innovation. Initial work [18,2] examined three variable classes or 
themes (c, b, a) in two literature categories (C and B) using two different ap- 
proaches (start with c, determine b, then determine a; start with c and a, then 
determine b). 

(NOTE: The sequence abc will typically (but not always) represent a time- 
varying process, such as a procession from research to development to systems. 
Where this sequence does represent a temporal process, the convention used in the 
remainder of this appendix is that the alphabetical designation of variables follows 
the arrow of time. Thus, a might represent a research variable or theme, b might 
represent a technology variable or theme, and c might represent a system variable 
or theme. The terms “variable” and “theme” are treated as interchangeable; 
“thematic variable” is used in places to emphasize this congruence.) 

The principal thematic variables determine a thematic literature. From the 
previous example, if Raynaud’s disease is the thematic variable specified initially, 
then the corresponding thematic literature might be all the papers in a given 
database that contain the phrase Raynaud’s disease. The remaining thematic 
variables and literatures are determined by applying different algorithms to the 
initial thematic literature and subsequent derived literatures. Again, from the 
previous example, an algorithm would be applied to the Raynaud’s disease the- 
matic literature to determine the thematic variable blood viscosity, and a derived 
literature could then be determined as all the papers in a given database that 
contain the phrase ’blood viscosity’. 

The first approach in the initial reported work [18,2] could be viewed as 
addressing the question: What variables a could influence variable c through 
mechanisms b, or, in the example described above, “What treatment factors a 
could influence Raynaud’s disease c through the different mechanisms 6.” This 
approach started with thematic variable c (e.g., Raynaud’s disease), and used 
this variable to develop thematic literature C. Algorithms were applied to this 
thematic literature database to identify thematic variable b values (61,62, etc., 
representing characteristics such as blood viscosity, blood flow, blood platelets, 
poor circulation, and others) closely linked to thematic variable c. Each value or 
theme of variable 6 (61, 62, etc.) was used to develop a thematic literature Bi,B2, 
etc. Algorithms were applied to each of the thematic B literatures to identify 
thematic variable a values (01,02, etc. representing characteristics such as fish 
oil, eicosapentaenoic acid, and others) closely linked to the specific thematic 
variable 6 of each thematic B literature. Values of the thematic o variables in 
each of the thematic B literatures not found in thematic literature C defined a 
subset of the thematic B literatures that was disjoint from thematic literature 
C (e.g., the term “fish oil” was not found in the Raynaud’s disease literature). 
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These disjoint thematic a variables and their associated thematic B literature 
subsets became candidates for discovery and innovation. 

The other approach reported could be viewed as addressing the question: 
What are the mechanisms b through which variable a could impact variable c. 
This approach started with variables c and a, and their associated literatures C 
and A, and identified variables b that were linked to both variables c and a. The 
same types of algorithms as in the first approach were used to identify closely 
linked variables, and the requirement for disjointness between literatures C and 
A was used as a basis for discovery. 

From the experience of these two approaches, it becomes clear that the inde- 
pendent and dependent variables chosen, and the algorithmic approach selected, 
depend on the question being asked. Further examination shows that other ap- 
proaches beyond these two are possible to answer other questions. The present 
paper examines seven approaches to generate innovation and discovery that are 
structured to answer seven different questions, and shows how the algorithms 
and techniques developed in Database Tomography are used in these approaches. 
More specific computational details of the latter six approaches approach can be 
found in [10]. 

A. 2 Specific Approaches 

The following discussion will be limited to scenarios of three variables a, 6, c, 
and two literatures. In future studies, more complex cases could be candidates 
for analysis and experimentation. 

For the simple two literature/three variable case, seven separate generic cases 
are possible, where the variables specified can be viewed as “independent” and 
the variables determined can be viewed as “dependent:” 

(1) specify a, determine b and c; (2) specify c, determine a and b; 

(3) specify b, determine a and c; (4) specify a and c, determine b; 

(5) specify a and b, determine c; (6) specify b and c, determine a; 

(7) specify a and b and c, validate linkage existence. 

Cases (1), (2), and (3) are the most open-ended and least constrained. In 
each case, one variable is specified, and the other two are determined using 
the DT algorithms, the condition of disjointness and, most importantly, expert 
judgement. Cases (4), (5), and (6) are more constrained, since two variables 
are specified, and the third is determined using similar processes to the above. 
Case (7) is fully constrained, and its purpose is to ascertain literature support 
for validation of a hypothetical relation between specified values of the three 
variables. Cases (4) and (5) are subsets of case (1); cases (4) and (6) are subsets 
of case (2); cases (5) and (6) are subsets of case (3); Case (7) is a subset of cases 
(1) through (6). The solution mechanics for each of these seven cases will now 
be outlined. 



Opportunity Driven. This first case addresses the question, “What are the 
potential variable c impacts that could result from variable a, and what are the 
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variable b mechanisms through which these impacts occur?” One specific variant 
of this question is of particular interest and importance to the science and tech- 
nology community, “What are the potential impacts on research, development, 
systems, and operations that could result from research on a given topic?” 

If the generic question of this first case is applied to the above example for 
the case where variable a is “fish oil” only, it could be phrased as, “What are 
the potential impacts or benefits (positive or negative) resulting from fish oil 
that would not be obvious from examining the fish oil literature alone?” This is 
an open-ended question, and places no restrictions on the mechanisms b or the 
types of impact c. The first case is represented schematically as: a — > 6 ^ c. 

Here, a is the independent variable, and b and c are the dependent variables 
that result from the solution process. The operational sequence is to start with 
the variable a and generate a literature A. Again following the above example 
and using the abbreviations FO (fish oil), BV (blood viscosity), and RD (Ray- 
naud’s disease), this means that the process would start by identifying the FO 
literature (call this Ai). Many approaches could be used to define this literature; 
the approach recommended here is the one used in recent DT studies [ 14 , 13 ] for 
defining literatures. As an example of one literature definition approach, the 
iterative Simulated Nucleation method [6] would be used to identify all the pa- 
pers in the Science Citation Index (SCI) which contained FO (and other related 
terms in the query) in the title, keywords, and abstract fields. This collection of 
papers would constitute the FO literature 

(NOTE: Use of the SCI is one example only. Because DT uses full text 
databases, there is no limitation in any database selected to titles or key works 
or index words, and many different types of databases or free text can he used 
for the analysis.) 

The next step in the process is be to identify the variables b {b\,b2, . . .) 
linked closely to variable a\, and then identify the literatures B associated with 
variable b (Ri, i?2, • ■ -the BV literatures). For this step, the proximity analysis 
method used in the recent DT studies (or other co-occurrence techniques) would 
be employed. For a journal based database, this method conceptually identifies 
phrases in paper titles or abstracts or main texts physically located near the 
term of interest. As an example, if the term of interest in a given database is 
Raynaud’s disease, then the proximity analysis method would provide a list of 
all phrases in close physical proximity to the term Raynaud’s disease for all 
occurrences of this term in the text. The proximity analysis approach of DT 
is based on the experimental findings that phrases within a semantic boundary 
(same sentence, paragraph, etc.) located physically close to the term of interest 
are contextually and conceptually close to the term of interest. Continuing the 
above example, this step uses the proximity analysis of DT to identify phrases 
in the FO literature physically close to the term FO, such as 6i, etc. 

For each of these identified phrases 61,62, etc. , a literature {Bi, B2, . . .) is 
established by querying the SCI. The next step is, for each of these B literatures, 
to identify the linked variables c (ci,C2,...) The process used to identify the 
variables 61,62, etc. linked to variable oi is repeated to obtain the variables 
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ci,C2, etc. linked to each value of variable b. The subsets of the B literatures 
which are disjoint from literature A\ (e.g., the B literatures which don’t contain 
the term FO) must then be identified, and the variables c (and their associated 
linking mechanisms b to variable ai) within these disjoint B literature subsets 
then become the candidates for discovery and innovation. 

It is obvious that the process can easily mushroom out of control unless 
stringent limiting constraints are placed on the number of B literatures and c 
variables selected. For example, suppose that three b variables 61,62,^3 (and 
their associated three B literatures {Bi, B2, B^) are identified as closely linked 
to FO. Suppose also that each of these three b variables is closely linked to five c 
variables. Then four literature searches are required {Ai, B\, B2, S3), and fifteen 
abc linked pathways must be examined for disjointness and discovery, according 
to the following: 



ai - 


-^bi- 


Cii; Oi - 


-^bi- 


C12; Ol - 


-^bi- 


C13; Oi - 


-^bi- 


C14; Oi - 


-^bi- 


C15 


ai - 


-> 62 - 


C21; ai - 


62 - 


C22; ai - 


62 - 


C23; ai - 


62 - 


C24; oi - 


&2 - 


C25 


ai - 


-4 63 - 


C31; ai - 


63 - 


C32; ai - 


63 - 


C33; ai - 


63 - 


C34; Oi - 


&3 - 


C35 



In reality, there will be hundreds, if not thousands, of candidate b and c vari- 
ables. However, there are different ways by which the b and c variables can be 
sharply limited in number. First, the analysts performing the study would elimi- 
nate all non-technical content phrases that passed through the trivial word filter 
in the DT algorithm. Second, the numerical indices for each phrase generated 
by the DT proximity algorithm would be used as one figure of merit for pre- 
selection of key phrases. Third, those c variables that reappear in different abc 
pathways would have a higher priority for selection. Fourth, analyst judgement 
would be applied to weight the potential value of the different abc pathways in 
computing figures of merit. 

The literature searches and proximity analyses are fairly straightforward, 
and have been refined in the DT process. The main intellectual efforts must be 
focused on prioritizing and reducing the number of linked variables or litera- 
tures to be examined, and interpreting the relationships among the final disjoint 
literatures to generate potential discovery relationships. 



Requirements Driven. This second case addresses the question, “What are 
the variables a that could impact variable c, and what are the variable b mecha- 
nisms by which these impacts are produced?” Applied to the above example for 
the case where c is Raynaud’s disease only, it could be phrased as “What are the 
factors and their associated mechanisms that could impact the course of Ray- 
naud’s disease that would not be obvious from examining the Raynaud’s disease 
literature alone?” This second case is represented schematically as: a ^ b ^ c. 
Here, c is the independent variable, and b and a become the dependent variables. 



Mechanism Driven. The third case addresses the question, “For a given mech- 
anism 6, what are the variables a that could impact the variables c?” Applied 
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to the above example for the case where b is blood viscosity, it could be phrased 
as, “What combinations of variables that could effect a change in the blood 
viscosity mechanism and could be impacted by a change in the blood viscosity 
mechanism are candidates for discovery that were not obvious from examining 
only the blood viscosity literature?” The third case is represented schematically 
as: a ^ b ^ c. Here, b is the independent variable, and a and c are dependent 
variables. 



Opportunity-Requirements Driven. This fourth case addresses the ques- 
tion, “What are the mechanisms b through which variable a could impact vari- 
able c?” Applied to the above example for the case where c is Raynaud’s disease 
only, and a is fish oil only, it could be phrased as, “What are the mechanisms 
through which fish oil could impact Raynaud’s disease that would not be obvious 
from examining only the Raynaud’s disease literature or the fish oil literature?” 
The fourth case is represented schematically as: a —> b ^ c. Here, variables a 
and c are independent, and variable b is the dependent variable. 

Opportunity-Mechanism Driven. The fifth case addresses the question, 
“What are the variables c which could be impacted by variable a through mech- 
anism 6?” Applied to the above example for the case where b is blood viscosity 
only, and a is fish oil only, it could be phrased as, “What abnormalities could be 
influenced from the impact of fish oil on blood viscosity that would not be obvi- 
ous from examining only the abnormality’s literature or the fish oil literature?” 
The fifth case is represented schematically as: a —>■ b —>■ c. Here, a and b are the 
independent variables, and c is the dependent variable. 



Requirements-Mechanism Driven. The sixth case addresses the question, 
“What are the variables a that could impact variable c through mechanism 6?” 
Applied to the above example for the case where b is blood viscosity only, and 
a is fish oil only, it could be phrased as, “What factors could impact Raynaud’s 
disease by impacting blood viscosity that would not be obvious from examin- 
ing only the factors’ literature or the Raynaud’s disease literature?” The sixth 
approach is represented schematically as: a c. Here, b and c are the 

independent variables, and a is the dependent variable. 



Opportunity-Mechanism-Requirements Validation. The seventh case ad- 
dresses the question, “Does the literature support the possibility that variable a 
could impact variable c through mechanism 5?” Applied to the above example 
for the case where a is fish oil only, b is blood viscosity only, and c is Raynaud’s 
disease only, it could be phrased as, “Does the literature support the possibility 
that fish oil could impact Raynaud’s Disease by altering blood viscosity in a way 
that would not be obvious from examining only the fish oil literature or the Ray- 
naud’s disease literature?” The seventh approach is represented schematically as: 
a ^ b ^ c. Here, a and b and c are independent variables. 
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B Crossing the Bridge: Interdisciplinary Workshops for 
Innovation 

B.l Background 

ONR established a series of workshops in 1997 aimed at promoting innovation 
while also enhancing organization, category, and discipline diversity components. 
The focus of the first novel workshop founded on this plan was “Autonomous 
Flying Systems,” an area of perceived long-term interest to not only the Navy 
and DOD, but also to NASA and other governmental and industrial organiza- 
tions. The process employed was designed starting with a clean slate and was 
intended for application to very significant technical challenges. The present 
appendix further describes the process that was used to identify the technical 
theme of the workshop, select the participants, and conduct all three phases of 
the total workshop. 



B.2 Workshop Theme Identification 

It was decided that the initial workshop theme should 1) focus on problems re- 
lated to the main science and technology emphasis area of the author’s home 
organization. Strike Technology, and 2) help establish the most supportive envi- 
ronment for innovation. The problem selected should be focused and understand- 
able, and it should have a generic technical base amenable to soliciting people 
from many different disciplines. The topic finally selected was autonomous con- 
trol of unmanned air vehicles, including takeoff and landing from limited areas 
on smaller Navy ships. It was apparent that the underlying science and tech- 
nology permeated many different disciplines, including aerodynamics, controls, 
structures, communications, guidance, navigation, propulsion, sensing, and sys- 
tems integration. Also, the naval applications for some aspects of this problem 
were sufficiently unique that probably not a great deal of work had been done 
in this area. Subsequent literature analyses validated this assumption. 

Present naval air systems are either manned (most aircraft) or tele-operated, 
semi-autonomous (weapons and some aircraft). The weapons are a mix ranging 
from “dumb” bombs and shells to “smart” missiles. The future trend is toward 
“smart” autonomous or semiautonomous aircraft and weapons. Since a major 
role of ONR is to proactively address the technology that will influence future 
naval forces, it seemed natural to examine S&T roadblocks on the path to un- 
manned autonomous “smart” flight systems. Consequently, the focus of the ini- 
tial workshop was defined as identification of the fundamental operational princi- 
ples of autonomous flying systems over a fairly wide range of flight environments. 
In particular, the workshop was aimed at examining what had been learned 
about autonomous or semiautonomous operation from the animal (mainly flying) 
kingdom and from other unmanned autonomous/semiautonomous tele-operated 
systems such as autonomous underwater vehicles and locomoted robots. Ani- 
mals are now being studied as integrated systems by scientists on the forefront 
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of biological research. The issues of aerodynamics, flight mechanics, dynamic re- 
configuration, materials, control, neuro-sciences, and locomotion are not being 
studied as separate disciplines by these scientists, but rather are being studied 
in parallel in the same animal system and in their relation to the function and 
mission of the animal system. While this integrative biological research is in its 
infancy, and results are only starting to emerge, the time seemed appropriate 
for assembling these diverse groups and exploiting their synergy. Not only could 
there be benefit to the Navy from such cross-discipline interaction, but benefit 
could be possible for each of the contributing disciplines as well. 

A major thrust of the workshop was projected to be identification of the 
autonomous operational principles for each unique system and the relation of 
these principles to mission and function, then extraction of the generic opera- 
tional principles that underlay all the systems, both biological and man-made. 
It was hoped that the cross fertilization of disciplines would be able to further 
elucidate and clarify the more important generic concepts, and then provide in- 
sight that could be utilized to enhance the autonomous operation of naval flying 
systems. 

B.3 Participant Selection 

Once the theme of the workshop was established, a sub-theme taxonomy was 
developed to focus the agenda and to identify workshop participants. A dual 
approach was followed to generate the taxonomy. 

Discussions were held with agency experts on the generic theme concerning 
the taxonomy structure. In parallel, the SCI was queried for papers related to the 
generic theme. Both bibliometric and computational linguistics analyses of these 
papers were performed to provide strategic maps of the topical area, identifying 
key performers, journals, institutions, and their relations to the technical themes 
and sub-themes of the workshop. A taxonomy was constructed based on these 
strategic maps. (For a description of how the bibliometric and computational 
analyses are combined to generate strategic maps, see [8,10]). 

(NOTE: If a combined literature-based and workshop-based approach is taken 
for stimulating innovation, as recommended in this paper, the literature-based 
analysis would be done at this stage.) 

Both of these taxonomy sources, in-house experts and the SCI, then provided 
initial candidates for participation in the workshop. These candidates were con- 
tacted, and asked to suggest additional candidates. This procedure continued 
until a large pool of potential candidates was established. Three main selection 
criteria for workshop participants were established; 

(1) multiple recommendations, 

(2) significant publications is the held, and 

(3) literature citations. 

These three criteria were tempered with judgement to insure that bright 
young individuals, who had not yet established a track record, were not excluded 
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from the pool, and that the panel as a whole had the correct level of discipline, 
category, and organization balance. In addition, a guideline was established that 
all workshop attendees would be active participants, so the number of attendees 
was limited to facilitate discussion and interactions. 

All these constraints, guidelines, and selection criteria were used to arrive at 
the final panel size and structure. The result was a panel of slightly more than 
twenty people representing a mix of disciplines that included biologists (experts 
in bird, bat, frog, fish, or insect studies), robotics, artificial intelligence, controls, 
autonomous aircraft, fluid dynamics, sensors, neuroscience, cognitive science, 
autonomous underwater vehicles, aerodynamics, propulsion, and avionics. 

B.4 Overview of Workshop Process Steps 

Workshop Buildup. The buildup period for the Workshop in question started 
about two months before the meeting. Specific guidance for the conduct of 
the workshop was sent to the participants by e-mail, including a statement of 
the naval technical problems to be addressed. The technical component of the 
buildup phase was then conducted by e-mail. 

The main purpose of this buildup phase technical component was to have 
each participant generate new ideas from his/her discipline for all other par- 
ticipants to consider. The other participants could then dialogue by e-mail to 
clarify/modify/embellish these ideas. At a minimum, even if no dialogue resulted, 
there would be a gestation period of about two months for each participant to 
absorb these concepts from other disciplines. Specifically, each participant was 
requested to: 

— submit a half dozen leading edge capabilities or accomplishments in his/her 
discipline (s) that could potentially impact the naval technical problems; and 

— identify several leading edge capabilities or accomplishments projected in 
his/her discipline(s) over the next decade that could potentially influence 
the naval technical problems; and 

— submit a few leading edge capabilities or accomplishments in his/her disci- 
pline (s) whose impact on the naval technical problems was not obvious to 
him/her, but might be obvious to someone else. 

The participants were free to comment on potential relations among any of 
the capabilities, accomplishments, or combinations of capabilities and accom- 
plishments, and any of the naval technical problems, or combinations of prob- 
lems. One of the functions of the participants from the author’s organization 
was to facilitate and stimulate dialog by raising questions and issues on the 
submitted information. 

In actual practice, most of the comments generated resulted from questions 
stimulated by the discussion facilitator, the author. All of the comments received 
were then sent to all the participants. This exercise helped stimulate the thinking 
of the participants, and provided a documented record of the process. 

If any of the participants saw a capability or accomplishment from another 
participant that could impact a problem in his/her discipline, but not impact a 
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naval technical problem, then the two participants were free to dialog together 
without informing all the participants. However, these two participants engaged 
in independent dialog were requested to keep a record of their exchange that 
might be included with the final workshop report as a potential innovation. This 
would cover the real possibility of innovation occurring in topics other than the 
one targeted. 

Workshop Meeting. As a result of the ideas presented during the buildup 
phase, it appeared that the seeds existed for a new S&T program on Autonomous 
Flying Systems. Therefore, an agenda was sent to the participants with further 
guidance to address promising S&T opportunities at the workshop, that would 
serve as the foundation of such a program. Specifically, the participants were 
asked to address the following issues at the workshop: 

— What are the present leading-edge capabilities in your discipline? 

— What are the desired future capabilities in your discipline? 

— What are the leading research opportunities in your discipline and what 
additional capabilities could they provide if successful? 

— What is the level of risk of these opportunities successfully achieving their 
targets? 

— How would these potentially enhanced capabilities contribute to, or trans- 
late into, improved understanding and/or operation of autonomous flying 
systems? 

The meeting occurred on 10-11 December 1997 at ONR. Since some of the 
leading edge capabilities and potential accomplishments appeared to have appli- 
cability to naval technical problems (identified during the e-mail buildup period), 
the proponent for the capability or accomplishment item took the lead in flesh- 
ing out his/her ideas and leading the discussion at the meeting. As a result, the 
workshop meeting tended to evolve into full panel discussions on each of these 
potential capabilities. 

There were two rounds of discussion at the workshop. The first round con- 
sisted of presentations and discussions by each proponent. The second round of 
the workshop consisted of each participant identifying his/her leading promising 
research opportunities. 



Workshop Cleanup. The participants were requested to provide any addi- 
tional narrative information that added to or modified their ideas as a result 
of the workshop experience. The outcomes of the workshop included both the 
tangible and intangible. 

Three immediate tangible outcomes were projected: 

(1) A concept proposal for an S&T program focused on Autonomous Flying 
Systems would be generated; 

(2) Technical papers may be submitted to leading science journals based on 
innovations identified; and 
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(3) One or more papers on the complete workshop experience might be submit- 
ted to leading science journals. 

In addition to developing specific topics, it was anticipated that new, un- 
exploited ideas in interdisciplinary research and development might surface dur- 
ing contact between panelists. These novel subjects might form the basis of 
additional workshops. In addition, extensive lessons were learned as a result of 
the workshop process. These lessons were summarized in Section 1.2. 
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Abstract. It is very difficult, if not impossible, for researchers to man- 
ually evaluate and revise their scientific models using a vast amount of 
relevant information now available to them. The paper describes a new 
framework, called JustAid, that successfully integrates techniques from 
Knowledge Acquisition and Machine Learning in a way that complements 
their strengths to overcome their weaknesses, and provides an interac- 
tive environment to help researchers in a process of scientific discovery. 
JustAid can use information stored in medical databases and assist ex- 
perimental scientists in forming, testing and revising scientific models, 
without a need of a knowledge engineer. In this paper, JustAid has been 
applied to a real world problem in the area of neuroendocrinology, a 
branch of physiology. 



1 Introduction 

A web-based service PubMed, of the National Library of Medicine (USA)^, pro- 
vides access to over 11 million citations from MEDLINE and additional life 
science journals. PubMed includes links to many sites providing full text arti- 
cles and other related resources. The Unified Medical Language System (UMLS) 
project, also by the National Library of Medicine (USA), develops and distributes 
multi-purpose, electronic “Knowledge Sources” that can be used by a wide vari- 
ety of application programs to overcome retrieval problems caused by differences 
in terminology and the scattering of relevant information across many databases. 
Thus, systems are being developed to allow researchers to access a vast amount 
of relevant information more quickly and easily. However, these systems do not 
yet provide automated tools that can be used to construct, test and invent new 
scientific models using the available information. This may lead to a possible 
oversight of important data and insights. This paper describes a new framework 
and a tool, called JustAid, for using information stored in medical databases 
to assist experimental scientists in forming and revising models. 



^ http://www.nlm.nih.gov/ 

K.P. Jantke and A. Shinohara (Eds.): DS 2001, LNAI 2226, pp. 214—227, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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2 Motivation: Hypothesis Testing in the Area of 
Nenroendocrinology 

In this section we introduce a small real world problem from the area of neu- 
roendocrinology. The problem illustrates a type of hypothesis and reasoning used 
in the area of nenroendocrinology. Smythe carried out experiments to investi- 
gate the role of central feedback of glucocorticoids in the control of adreno- 
corticotropin (ACTH) release by examining the effect of acute dexamethasone 
pre-treatment on hypothalamic noradrenergic neuronal activity (NNA) and on 
the secretion of ACTH and corticosterone (CORTICO) in response to stress [5] 
(see Fig. 1). The author of the paper was interested in explaining the effects of 
dexamethasone on normal rats and the effects of stress on rats with dexametha- 
sone pre-treatment, using his proposed hypothesis shown in Fig. 1. He carried 
out a set of experiments to observe the effects of acute dexamethasone pre- 
treatment on hypothalamic noradrenergic neuronal activity (NNA), adrenocor- 
ticotropin (ACTH) secreted from the pituitary gland and serum corticosterone 
(CORTICO) secreted from the adrenals. Importantly he wanted to observe these 
effects under normal conditions and also in response to stress. He wanted to use 
stress because it was assumed to activate the whole system. Overall 20 rats were 
used in the experiment. To induce stress, rats were forced to swim in cold water. 




Fig. 1. The hypothesis from [5] 



2.1 Experimental Results 

The measured parameters in the experiments are NNA, ACTH and CORTICO. 
The hypothalamic peptides corticotrophine releasing factor (CRF) is not mea- 
sured in this experiment. Table. 1 shows the values for these parameters after 
injecting dexamethasone, and also after cold swim stress in rats. 
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Table 1. Experimental results from [5] 





Experiment- 1 


Experiment-2 


Experiment-3 


Experiment-4 




controls 


Dexameth- 

asone 


Cold Swim 
Stress 


Dexamethasone 
and Cold Swim 
Stress 


NNA (ratio - no units) 


0.122 


0.105 


0.210 


0.246 


CORTICO (nmol/L) 


129 


11.3 


1232 


32.8 


ACTH (pg/ml) 


89 


Undetectable 
(close to zero) 


240 


Undetectable 
(close to zero) 



In Table. 1, the second column for “controls” (call it Experiment-1) indi- 
cates the values for normal rats (without any treatment). The third column for 
“Dexamethasone” (call it Experiment-2) indicates the values after injecting dex- 
amethasone. The fourth column for “Cold Swim Stress” (call it Experiment-3) 
indicates the values after inducing cold swim stress. The fifth column for “Dex- 
amethasone and Cold Swim Stress” (call it Experiment-4) indicates the values 
after inducing cold swim stress on rats with dexamethasone pre-treatment. 

2.2 Hypothesis Is a Causal Model 

Fig. 1 shows the hypothesis presented in the research paper [5]. In the hypoth- 
esis, links represent causal relations between the nodes (parameters). These 
relations represent qualitative influences: stimulatory (direct or -I-) or inhibitory 
(inverse or -). For example, in Fig. 1, the inhibitory link from CORTICO to 
NNA represents the following: 

increase in NNA can be explained by decrease in CORTICO and 
decrease in NNA can be explained by increase in CORTICO 

Similarly, the stimulatory link from ACTH to CORTICO represents the 
following: 

increase in CORTICO can be explained by increase in ACTH and 
decrease in CORTICO can be explained by decrease in ACTH 

No further assumptions regarding these links (influences) are made. This 
is particularly important, as researchers do not have sufficient information re- 
garding such influences, except that they are either stimulatory or inhibitory. For 
example, they may know that injection of dexamethasone will have an inhibitory 
effect on hypothalamic noradrenergic neuronal activity (NNA). However, they 
do not know the precise amount of decrease in NNA for a specific increase in 
dexamethasone. As discussed in the following section, such hypotheses are used 
in deriving explanations for experimental results. Fig. 2 shows another (large) 
hypothesis, from the area of neuroendocrinology, published in [6] and which 
subsumes the model in Fig. 1. 
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Fig. 2. The hypothesis from [6] 



2.3 Completeness of the Model (Explanations for Differences) 

As mentioned earlier, the author was interested in examining the effects of 
dexamethasone (Dex) on normal rats and also the effects of stress on rats with 
dexamethasone pre-treatment, using his proposed hypothesis (see Fig. 1). The 
experimental results in Table 1 show that after injecting dexamethasone, the 
value of NNA decreased from 0.122 (in the second column) to 0.105 (in the third 
column), the value of CORTICO decreased from 129 (in the second column) to 

11.3 (in the third column) and the value of ACTH decreased from 89 (in the 
second column) to close to zero (in the third column). We can summarise this 
difference as below, 

Cause: Dex is increased 

Effects: NNA is decreased, CORTICO is decreased, ACTH is decreased 

The above effects represent a tonic action of dexamethasone on the brain 
to influence hypothalmic NNA and CRF release. The reduced hypothalmic 
NNA could contribute to the lower levels of ACTH and CORTICO, through 
reduced CRF. The only constraint imposed while deriving these explanations 
is that, a node cannot be assumed to have two different states in explaining 
one difference. For example, in the above difference, we need to explain three 
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effects. We cannot assume CRF increasing in explaining one effect and CRF 
decreasing in explaining another effect. 

Similarly, after inducing cold swim stress on rats with dexamethasone pre- 
treatment, the value of NNA increased from 0.105 (in the third column) to 0.246 
(in the fifth column) and the value of CORTICO increased from 11.3 to 32.8. The 
values of ACTH were undetectable in both the cases. This means, ACTH could 
have increased or decreased below sensitivity level. As we are not sure, we do 
not need to explain the change in ACTH. These results represent a stimulatory 
effect of stress on the hypothalmic NNA. Increased secretion of CORTICO can be 
attributed to increased NNA, through increased CRF and ACTH. If we assume 
that ACTH increased below sensitivity levels, the hypothesis can explain the 
above changes. Thus the hypothesis presented in the paper [5] was able to explain 
both the crucial differences that were of interest to the author. However, the 
hypothesis is not able to explain the following two differences: 

1. The difference between Experiment- 1 and Experiment-4: 

Causes: Dex is increased, ColdSwimStress is increased 

Effects: NNA is increased, CORTICO is decreased, ACTH is decreased 

2. The difference between Experiment-3 and Experiment-4: 

Causes: Dex is increased 

Effects: NNA is increased, CORTICO is decreased, ACTH is decreased 



That is, we cannot trace back the corresponding effects to one of the causes using 
the proposed hypothesis (see Fig. 1). For example, for the first difference above, 
we can explain increase in NNA because of Cold Swim stress, but we cannot 
explain the other two effects: CORTICO decreasing and ACTH decreasing, as 
the model indicates that increase in NNA should increase ACTH. 

The author overlooked the above two differences because they were not rel- 
evant to his claim in that paper. The author considered two crucial differences 
while constructing his hypothesis and did not include the other differences in his 
context. As a result of this the hypothesis presented in the paper was not able 
to explain some of the differences. It should be noted that the paper [5] was a 
refereed paper published in a prestigious international journal. Neuroendocrinol- 
ogy is a particularly challenging domain and the complex hypotheses discussed 
in the neuroendocrine papers are far from certain. They may be well accepted 
but are not necessarily well proved (tested against many experimental results). 
Researchers are always trying to present experimental data to support their own 
hypotheses or deny alternative hypotheses. 



2.4 Incompleteness Checking 

[1] describes a system, called JUSTIN (JUSTification IN context), that can be 
used to build and test models in neuroendocrinology. The aim of their research 
was to use qualitative reasoning to build and test hypotheses in the area of 
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neuroendocrinology. In [1], JUSTIN was used to test the model (hypothesis) of 
brain influence on glucose homeostatis published in [6] .The paper [6] summarises 
six other neuroendocrine papers covering a range of hypotheses regarding brain 
control of glucose. The part of the summarised model (hypothesis) published 
in [6] is shown in Fig. 2. JUSTIN generated all possible differences and tested 
the model against them. It reported that 32% of the differences (between ex- 
perimental treatments) could not be explained by the model. Thus JUSTIN can 
test the incompleteness of the current model but it cannot revise the current 
model such that it can explain all the observations. Just Aid, described in this 
paper, is directed towards automating the model-revision process in the area of 
neuroendocrinology. 



3 JustAid 

JustAid allows an expert to easily build partial models of the domain, without 
any need of a knowledge engineer. JustAid can carry out incompleteness checking 
using the available observations, and if the current model cannot explain all 
the observations (situations), techniques from Machine Learning are used to 
invent possible new models that can explain all the observations. The learning 
algorithm can effectively use partial models of the domain while inventing new 
possible models. The learning algorithm also allows an expert to provide domain 
dependent criteria to guide the search process while looking for a new suitable 
model. Thus, JustAid uses domain knowledge in two different ways: initially to 
generate partial models and later to guide the search process while looking for 
a new suitable model. 

In JustAid, an expert only deals with directed causal models throughout the 
process of model creation and modification. The graphical models are automat- 
ically converted into Horn-clause logic and a model-revision process is accom- 
plished by a logic-based learner. As shwon in Fig. 3, the aim is to provide an 
intuitive and effective user interface that allows an expert to focus on the issues 
related to the domain and not worry about modeling constructs or underlying 
logical representation. 



4 Model Completion Using JustAid 

Space restrictions prevent a presentation of the theoretical basis for model com- 
pletion in JustAid. We refer the reader to [2] and [3]. Here we simply note that 
this model completion is derived within a logical setting that forms the basis 
of Inductive Logic Programming. JustAid incorporates a new incremental learn- 
ing technique that can learn definite clause logic programs from observations 
that are not in the form of definite clauses. In this section we provide informal 
description of the learning technique used in JustAid. 

If the current theory cannot explain a given observation, we want to invent 
a new hypothesis such that the current theory along with this new hypothesis 
can explain an unexplained observation. Abduction uses a general rule and the 
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Fig. 3. Some of the featnres provided by the user-interface of JustAid 



known conclusion to infer a specific fact that might be the cause of the known 
conclusion. That means, given a general rule (the current theory), we can use 
abduction to infer a specific fact (an abducible) that might be the cause of the 
known conclusion (unexplained effects). In other words, given a general rule (the 
current theory), we can explain the known conclusion (effects) if we can explain 
(derive) any one of the following: 

• the known conclusion itself, or 

• a possible abducible 

Deduction can be used to infer possible facts given a general rule (the current 
theory) and a specific known fact (cause). All such inferred facts (call them 
deducibles) are true and we can use them while constructing a new hypothesis. 

The aim now is to use these deducibles and abducibles, and construct a 
new hypothesis that can explain the effects or an abducible. We also want to 
represent this new hypothesis as a causal qualitative model and therefore we 
want to construct a new hypothesis that can be represented as a directed causal 
link(s). For the simple example shown in Fig. 4, we can explain effects if we can 
explain any one of the following: 

• (n is increasing) and (p is decreasing) [the known fact] 

• (n is increasing) and (f is decreasing) [abducible] 
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Cause : 

a is increasing 
Effects : 

n is increasing 
p is decreasing 



Fig. 4. A simple causal qualitative model, and an unexplained observation 



• (d is decreasing) [abducihle] 

Using deduction we can derive the following fact, 

• (c is decreasing) 

We can now construct a new hypothesis (directed causal link) as shown in 
Fig. 5. The hypothesis can explain the abducible (d is decreasing), given the 
deduced (c is decreasing). 





Fig. 5. New hypothesis 



If we add the causal link shown in Fig. 5 to the current theory, we can explain 
the required effects, given the corresponding cause. Alternatively, we can also 
add other possible links that can explain some other possible abducible. However, 
we may want to select the link shown in Fig. 5 because it will require minimum 
change to the current theory (we need to add only one link). 



4.1 Learning from Multiple Observations 

The learning program should be able to find such common links that can fully or 
partially explain more than one observation (if possible). Here, partially means 
a link may explain some of the effects of an observation (difference), but not all 
the effects of that observation. We can use the following two indicators to find 
common links that can fully or partially explain more than one observation (if 
possible). 
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We say an ExplanatoryPowereffect of a link represents the number of ef- 
fects that can be explained by adding that link to the current model. We can use 
this criterion and calculate ExplanatoryPowereffect for every link we can induce 
for a given observation. 

Let’s assume we have n unexplained observations. Let us assume set Si con- 
tains all the new learned links for observation-i. Let us assume set Si also stores 
the maximum ExplanatoryPowereffect for every link in set Si. For example, let’s 
assume there are two possible explanations for observation-i, and we need to 
induce the direct link from x to y for both these explanations for observation-i. 
Let us say, in the first explanation the ExplanatoryPowereffect of the direct link 
from cc to 2 / is 1 (it can explain one effect) and in the second explanation it is 3 (it 
can explain 3 effects) . The set Si needs to store 3 as the ExplanatoryPowereffect 
for the direct link from x to y. 

We can now merge all Si’s and generate a set Sail that contains all the learned 
links for all the observations. We can calculate ExplanatoryPowereffect of a link 
in the set Sail by adding values of ExplanatoryPowereffect of that link in all Si’s. 
Thus, the value of ExplanatoryPowereffect in the set Sail indicates the number 
of effects that can be explained by a given link in explaining all the observations. 

Similarly, we can also introduce another indicator called ExplanatoryPow- 
erobservation- The aim here is to count number of observations that can be fully 
or partially explained by a given link. ExplanatoryPower observation of a link in 
the set Sail is equal to the number of Si’s that contains that link. We can now 
present the links in the set Sail to an expert by ordering them in non-increasing 
(descending) order on the values of ExplanatoryPowereffect or ExplanatoryPow- 
erobservation- If the links are ordered on ExplanatoryPower observation, the output 
list will have all the common links (if any) at the top of the list. If the links 
are ordered on ExplanatoryPowereffect, the top of the output list will contain 
links that can explain many effects across all the observations. Depending on 
the domain knowledge, an expert can select a suitable link(s) from this ordered 
list and modify the current model. Thus, JustAid can use multiple observations 
along with the current theory and propose alterations that may require minimum 
changes (not necessarily optimal) to the current model. 



4.2 Additional Biases in JustAid 

JustAid allows an expert to specify explicit biases depending on the type of model 
being reasoned about and the particular experiment involved. The learning al- 
gorithm uses these explicit biases in reducing the search space while looking for 
a suitable hypothesis. After consultations with the domain expert in the area of 
neuroendocrinology, the following explicit biases are implemented in the JustAid 
learning system. 

Focus - an expert can provide a sub-graph(s) in which they are interested. 
The source nodes and destination nodes of new learned links should be part of 
this sub-graph(s). 
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Exclude Sub Models - an expert can provide a sub-graph(s) that should not 
be changed. The source nodes and destination nodes of the new learned links 
should not be part of this sub-graph(s). 

Only Incoming/ Outgoing nodes - an expert can specify node (nodes) for which 
the incoming or outgoing links are not possible. 

Impossible Links - an expert can specify link (or links) which are not possible. 

5 Experimental Results 

In this section, we will examine the effectiveness of JustAid to cope with in- 
creasingly incomplete models. Given the model in Fig. 2, we are in a position 
to introduce artificial “model incompleteness” by random removal of links. Such 
removals may result in a number of observations being unexplained. Note that, 
even after removing a link(s), we might be still able to explain a given obser- 
vation provided we could find an alternative explanation path in the reduced 
model. That means, it does not follow that a large number of missing links 
would necessarily result in a large number of unexplained observations. This is 
particularly true if we have many redundant links in the model. As noted by [4], 
the model described in Fig. 2 does contain redundant links. Note that, a domain 
expert may want to keep these redundant links if such relationships do exist. 
This point notwithstanding, in the following section we will discuss the model 
construction ability of JustAid for different number of unexplained observations. 

5.1 Experiment Set Up 

The model in [6] and Fig. 2 was constructed in JustAid. We then tested the com- 
pleteness of the model against observations (differences) . We take this model and 
the observations that it can explain. The aim after this is to remove links (ran- 
domly) from the model so that it is not possible to explain all the observations 
that were previously explained by the original model. This will provide an in- 
complete model, the corresponding unexplained observations (differences) , and a 
set of deleted links. JustAid can then take an incomplete model and unexplained 
observations as input and propose possible new links. 

JustAid expects experts’ input in selecting links. JustAid may propose many 
possible new links for a given incomplete model. However, some of these links 
may not be suitable (or feasible) if one considers the underlying biological sys- 
tem. If we select a new link(s) only based on its ability to explain unexplained 
observations, we may construct a new model that is useless from the point of 
view of neuroendocrine domain, so the involvement of an expert is critical. The 
aim is to select previously deleted links from the suggested links. The rationale 
is that the expert had included these links in the original model and therefore 
we can consider them as suitable links from the point of view of neuroendocrine 
domain. Note that the aim in this research is not to replace an expert with an 
automated learner but to support them. This is because there may not be enough 
information in the model and data for the learner to propose new models that 
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are appropriate for the domain. In contrast, an expert knows the domain and 
can use additional domain knowledge (not available to the learner) in selecting 
a suitable new model. 

In the first experiment, we removed (randomly) 11 links from the model so 
that it cannot explain all the observations (differences). We created 1000 such 
different models in which we randomly removed 11 links. For each of these models 
and corresponding unexplained observations (differences), we used Just Aid to 
invent new links. The overall framework for these experiments is described below. 

1. Randomly delete 11 links from the original model. 

2. Using JustAid, find the number of unexplained observations (differences) and 
generate a list of new learned links (ordered on ExplanatoryPowerobservation) 
that can be added to fully or partially explain one or more unexplained 
observations. 

3. If the list contains one of the deleted links, that link is added to the model. 
If there is more than one deleted link in the list, the deleted link with the 
highest explanatory power is selected and added to the model. If there are 
no deleted links in the list, exit the experiment. 

4. Using JustAid, find the number of unexplained observations (differences) 
after adding the above learned link, and calculate the number of unexplained 
observations that can be explained by adding the above learned link. 

To compare the performance of JustAid against a random selection of a link 
from the deleted links, we modified the 2"'* and 3'’'^ steps in the above setup and 
repeated the above experiments by randomly selecting a deleted (suitable) link. 
Similarly, we repeated the above experiments by selecting a link from the highest 
ranked links (highest ExplanatoryPowerobservation) by JustAid, irrespective of 
whether or not it was previously deleted. 



5.2 Results 

For models where 11 links are removed (randomly) from the original model. Fig. 6 
shows the average number of unexplained observations explained by adding one 
learned link selected from the suitable (deleted) links using JustAid’s sugges- 
tions (ranking). It also shows the average number of unexplained observations 
explained by adding one link selected randomly from the suitable (deleted) links. 
The error bars in the figure represents the standard error of the mean (S.E.M)^ 
for each number of unexplained observations. 

Note that, for the above two criteria, we select a link that was a part of the 
original model that explained all the observations. Thus, it is natural to expect 
that the selected link would explain some unexplained observations. However, as 
shown in Fig. 6, a link selected from the suitable (deleted) links using JustAid’s 
suggestions could explain many more unexplained observations (differences) than 
a link randomly selected from the suitable (deleted) links. 

^ S.E.M = (standard deviation) / Sqrt(sample size) 
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Number of unexplained observations after randomly 
removing 11 links from the model 



Fig. 6. Average number of unexplained observations explained by adding one link 



Fig. 6 also shows that as the number of unexplained observation increases, 
the number of these unexplained observations explained by adding a single link, 
selected from the suitable links using JustAid’s suggestions, also increases. Jus- 
tAid tries to find a common link(s) that can explain as many observations as 
possible. That means, if the number of unexplained observations increases, the 
chances of finding a common link that can explain more unexplained observa- 
tions also increases. 

If we select a link only based on JustAid’s ranking (irrespective of whether 
it was deleted or not), we can explain more unexplained observations than the 
other two criteria. 

Note that, we used the same set of random numbers for all the three ex- 
periments described above. For example, after randomly removing 11 links, we 
carried out all the three experiments for the reduced model. 

While the above results confirms our expectations about a single link pro- 
posed by Just Aid, interest clearly remains in the complete model (that is one 
that can explain all the observations). The overall framework for these experi- 
ments is described below: 

1. Randomly delete X (where X G {5, 15}) links from the original model. 

2. Using JustAid, generate a list of new learned links (ordered on 
Explanatory? owerobservation) that can be added to fully or partially explain 
one or more unexplained observations. 

3. If the list contains one of the deleted links, that link is added to the model. 
If there is more than one deleted link in the list, the deleted link with the 
highest explanatory power is selected and added to the model. If there are 
no deleted links in the list, exit the experiment. 
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4. Repeat steps 2 and 3 until the model explains all the observations. 

Again, to compare the performance of Just Aid against a random selection of 
a link from the deleted links, we modified the 2""^ and 3’’'^ steps in the above setup 
and repeated the above experiments. Similarly, we repeated the above experi- 
ments by selecting a link from the highest ranked links (highest Explanatory- 
Power observation) by Just Aid, irrespective of whether or not it was previously 
deleted. The results are shown in Table 2. 



Table 2. Average number of new links added to explain all the observations, for 
different model sizes 





Average number of new links added to explain all the observations 


No of links Re- 
moved from the 
model 


Random selection of 
a deleted link 


Selection of a deleted 
link with highest ex- 
planatory power 


Selection of a link 
(deleted or not) with 
highest explanatory 
power 


5 


3.99 ± 0.12 


1.8 ± 0.081 


1.63 ± 0.08 


15 


12.78 ± 0.23 


4.6 ± 0.197 


4.02 ± 0.158 



Table 2 shows that the model does have redundant links and justAid can 
be used to remove them (if required). It should be noted that indeterminacy in 
qualitative reasoning would produce many possible explanations for a given set 
of effects. Therefore it might be possible to remove few links and still explain all 
the effects. However, if an expert is satisfied with all the links in the model, they 
may not choose to remove any links. It should be noted that although links may 
not be required for this particular set of observations, the model will generally 
reflect a range of other well established views about the system (world). The 
redundancy in biological control systems may also contribute to some of these 
redundant links. Thus, the issue of removing redundant links only arises when 
an expert does not strongly believe in the entire hypothesis and may want to 
explore few alternative hypotheses. 

We have also empirically evaluated JustAid when used by the domain expert, 
to revise a real world neuroendocrine model described in [6] . The domain expert 
found the interaction with JustAid very friendly and stimulating. We will report 
the results of this study in our future publications. 

6 Conclusions 

We have described a new framework and a tool, called JustAid, that can assist 
researchers in checking that their scientific models can explain available data 
and that can make useful suggestions to researchers about how to improve their 
models. Even with the very simple causal reasoning tools described here applied 
to real test cases, it has been possible to point out to a researcher problems 
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with their models and make useful suggestions as to how they can be improved. 
This has been possible because a researcher is normally focussed on a specific 
hypothesis, whereas a computer program will search through all the material 
available. The experience of interacting with such a system for the researcher 
is not so much as interacting with a school master correcting mistakes, but 
interacting with a lateral thinker suggesting other things that need to be taken 
into account. The researcher sees the program not as pointing out errors but 
in checking ramifications of the model that they would not normally consider. 
The researcher found this interaction stimulating. If this sort of result can be 
found with the simple prototype we have described here, we anticipate that as 
these sort of tools become more developed they will have a central place in the 
experimental researcher’s toolkit. 
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Abstract. Deduction, induction, learning, are various aspects of a more 
general scientific activity: the discovery of truth. We propose to em- 
bed them in a common, logical framework. First, we define a general- 
ized notion of “logical consequence.” Alternating compact and “weakly 
compact” consequences, we stratify the set of generalized logical conse- 
quences of a given theory in a hierarchy. Classical first-order logic is a 
particular case of this framework; the fact that it is all about deduction 
is due to the compactness theorem, and this is reflected by the collaps- 
ing of the corresponding hierarchy to the first level. Classical learning 
paradigms in the inductive inference literature provide other particular 
cases. Finite learning corresponds exactly to the first level (or level Si) 
of the hierarchy, whereas learning in the limit corresponds to another 
level (namely S 2 ). More generally, strong and natural connections exist 
between our hierarchy of generalized logical consequences, the Borel hi- 
erarchy, and the hierarchy which measures the complexity of a formula 
in terms of alternations of quantifiers. It is hoped that this framework 
provides the foundation of a unified logic of deduction and induction, and 
highlights the inductive nature of learning. An essential motivation for 
our work is to apply the theory presented here to the design of “Inductive 
Prolog” , a system with both deductive and inductive capabilities, based 
on a natural extension of the resolution principle. 



1 Introduction 

Let us first make a few remarks about the nature of deduction and the nature of 
induction, before we turn to the nature of learning. If a formula is a deductive 
consequence of a set of formulas T, it is clear to anyone that is a logical 
consequence of T, in the sense that ip is true in every model of T. Many would 
also agree that we can substitute “inductive” for “deductive” in the previous 
sentence. What is then the difference between ip being a deductive and ip being 
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an inductive consequence of T1 Well, it should be possible to discover with 
certainty that v? is a deductive consequence of T, if this is indeed the case. 
Whereas it should not be possible to discover with certainty that ip is an inductive 
consequence of T, if p is not in fact a deductive consequence of T. How can we 
discover with certainty that p is true on the basis of T? A natural answer is: if 
and only if p is actually a logical consequence of a finite subset of T. In other 
words, if and only if is a compact logical consequence of T. On the other 
hand, if p is an inductive, but not a deductive, consequence of T, then we need 
an infinite part of T, if not the whole of T, in order to be able to establish this 
fact. At this point, two questions emerge: 

1. What should count as a model of T? If it is any structure (as defined in 
classical logic) in which every member of T is true, and if T consists of 
first-order formulas, then the compactness theorem shows that every logical 
consequence of T is actually a deductive consequence of T, and there is no 
scope for a proper notion of induction. Hence, we should be able to consider 
not all structures, but some of them. This would result in a generalized 
notion of logical consequence that might not be compact. Are there natural 
candidates for such sets of structures? 

2. Suppose that the class of models of T that have been retained is such that 
some generalized logical consequence p oi T \s not a deductive (compact) 
consequence of T. Is then p automatically “promoted” to the status of in- 
ductive consequence of T? The fact that every model of T is a model of p 
involves infinitely many members of T. But how difficult is it to conclude 
that p is true on the basis of T1 If we can define difficulty levels, should one 
of them be considered as “the inductive level?” 

Let us now consider learning (for more details on the notions mentioned below, 
see [7]). We claim that the classical paradigms in the inductive inference liter- 
ature are also about discovering the truth. Suppose that the underlying logical 
vocabulary consists of a unary predicate symbol P, together with a constant n for 
each natural number n. A language L can be identified with T = {Pfi\ n G L}, 
and its complement with T = {^Pn \ n € N\ L}. A text (respect, informant) for 
L can then be identified with an enumeration of T (respect. TUT). The task 
of discovering an r.e. index for (respect, the characteristic function of) L can 
be identified with the task of discovering the infinitary formula /\T (respect. 
AT A At). Clearly AT is a logical consequence of T — in the classical sense, 
hence also in any more general sense. Retaining only the structures that corre- 
spond to languages i.e., the intended possible realities, will make /\T A ATa 
generalized consequence of T. So in both cases, identification in the limit is about 
discovering a particular generalized logical consequence of T, namely a formula 
that can be viewed upon as a description of the language to be learned. On the 
other hand, the task of discovering the truth of an arbitrary formula p from a 
background theory is equivalent to partial classification (see [4]): a formula p 
represents the class C of all theories T which logically imply p in a general sense, 
and the partial classifier has to find out, on the basis of data from background 
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theory T, that is a generalized logical consequence of T, whenever this is true. 
Note the following: 

1 . Considering the infinite formula Puq A PnT A Pfi 2 A . . . rather than the index 
of a Turing machine which generates no,ni,n 2 . . . provides a logical repre- 
sentation equivalent to a representation in terms r.e. indexes. But infinite 
formulas are not only a technical way of embedding learning paradigms into 
a logical framework. It turns out that the extension of first-order languages 
to countable fragments of (see below) are the natural logical languages 
of our framework. 

2. When learning from positive data only, there is an implicit assumption that 
all positive data are enumerated in a text for a language L. This means that 
the models of {Pn| n G L} to be considered should not be consistent with 
Pn for any n € N\L. The notion of generalized logical consequence, together 
with the right set of structures, should be able to accommodate this kind of 
property. 

The previous considerations go beyond epistemological concerns about the na- 
ture of induction or learning. Indeed, the aim of this work is also to investigate 
the foundations of induction in AI. Current work in Inductive Logic Program- 
ming (see [14]) focuses on a very specific inductive task: discovering the minimal 
model of a potentially infinite set of data. We take a more general view and in- 
vestigate general inductive abilities. Considering both deduction and induction 
as particular expressions of the art of discovering the truth opens the door to 
a unified framework which can provide the basis of an “Inductive Prolog” . If 
Prolog is the deductive engine of AI, giving an agent the ability to compute so- 
lutions to existential queries. Inductive Prolog should be the deductive-inductive 
engine of AI, giving an agent the ability to compute solutions to existential or 
S 2 queries, such as: does there exist a chemical compound which has this effect 
on all molecules having this and that property? 

We proceed as follows. In Section 2, we introduce the necessary notation 
and in Section 3, we describe the components of our framework. In Section 4, we 
define hierarchies of generalized logical consequences by the alternating the use of 
compactness and weak compactness, and show some of their properties relevant 
to learning paradigms. In Section 5, we investigate the relationship between 
the hierarchies of generalized logical consequences and formula complexity. As 
additional evidence of their naturalness we also demonstrate links with the Borel 
hierarchy. Finally, in Section 6, we show how a number of classical learning 
paradigms can be cast into our framework. 



2 Notation 

A vocabulary is a countable set of function symbols (possibly including constants) 
and predicate symbols. A vocabulary can, but does not have to, contain equality. 
If it does not, it is said to be equality free. From now on, S denotes an arbitrary 
countable vocabulary. For some results, assumptions on S will be made. We 
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denote by the set of all first-order S'-formulas, and by il-f the extension of 
that accepts countable nonempty conjunctions and disjunctions.^ So for all 
countable nonempty T C the disjunction of all members of T, written V T, 

and the conjunction of all members of T, written /\ T, both belong to Note 

that the occurrence or nonoccurrence of = in S' determines whether and 
loj are languages with or without equality. A countable fragment of is 
a countable subset G of which contains is closed under subformulas, 
boolean operators, and quantification.^ From now on, G denotes a countable 
fragment of It represents the language on the basis of which the core of 

the theory is developed. Clearly, is the smallest countable fragment of 
The members of which are in Sq or iTg prenex form are the quantifier free 
members of Let nonnull ordinal a and G be given. We say that (p 
is in Set (respect. Ila) prenex form just in case one of the following holds: 

1. is in Sff or Up prenex form for some P < a, or 

2. (fi is of the form 3xip (respect. Vx'0) for some xf G which is in S^ 

(respect. TTq) prenex form, or 

3. V? is of the form \J X (respect. f\X) for some (countable) X C ^ii of 
whose members are in Sa (respect. Ilct) prenex form. 

It is easy to verify that every member of is logically equivalent to a member 
of which is in 27^ prenex form for some a. If G is logically equivalent 
to a closed member of which is in S^, (respect. iT„) prenex form, then we 
say that p is S^ (respect. II a)- Note that the classical definition of a member of 
^tui being Sn (respect. Iln) for some n G N is a particular case of the former. 

The ~ operator is the function which is defined as follows. 

If G is of form for some xp G then ~ (ip) = xp; otherwise 

~ (p) = -up. Given F C and S'-structure 911, the F -diagram o/fOl, denoted 
Dr{Tt), is the set of all members of F that are true in 971. Terms will refer to 
S'-terms, formulas to members of F (not sentences to closed formulas, 

and structures to 5'-structures. A Henkin structure is a structure all of whose 
individuals interpret closed terms. ^ A Herhrand structure is a structure each of 
whose individuals interprets a unique closed term."^ Hence Herbrand structures 
are Henkin. When we consider a Henkin or a Herbrand structure, or a nonempty 
class of Henkin or Herbrand structures, we tacitly assume that S contains at least 
one constant. 

^ Given regular cardinal n, denotes the set of all S-formulas built from atomic 
S'-formulas using boolean operators, quantifiers, and disjunctions or conjunctions 
over nonempty sets of cardinality smaller than n. See [10]. 

^ For more details on this definition, see [2] or [12]. 

® Henkin structures should not be confused with Henkin models, which is the name of- 
ten given to the general models defined in [6]. Our notion of Henkin structure is closer 
to the canonical structures defined in [17] for Henkin’s proof of the completeness of 
hrst-order logic. 

^ Herbrand structures are close to the Herbrand models considered in Logic Program- 
ming. See [3] or [11]. 
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3 Components of the Theory 

We denote by W a class of structures, the class of possible worlds. Classical first- 
order logic would take for W the class of all structures. We have explained that 
in order to address questions such as deduction versus induction, we need to be 
free to choose a more restrictive class of possible worlds. The discussion about 
learning suggests the consideration of W to be the class of all Henkin structures, 
or the class of all Herbrand structures. Henkin and Herbrand structures are 
interesting in many respects, and play a prominent role in Logic Programming 
([11]). Given T C L, we denote by Modw(T) the class of all members of W that 
are models of T. 

We denote by 0 a set of sentences, that we call the class of possible observa- 
tions. For classical first-order logic, the choice of 0 would be irrelevant. Suppose 
we want to cast learning paradigms into this framework. For learning from posi- 
tive data only, 0 will be equal to the set of all atomic sentences; for learning from 
both positive and negative data, 0 will be equal to the set of all basic sentences. 
Other examples can also be found in the literature (see for example [9]). 

We denote by T a set of sets of sentences, that we call class of possible theories. 
This corresponds roughly to the class of possible texts in the inductive inference 
literature. Classical first-order logic would take for T the set of all sets of closed 
members of 

The quintuple (S', C, W, 0, T) contains all we need to define the fundamental 
concepts of this framework. We call this quintuple the paradigm under investi- 
gation, and we denote it by IP. 

Definition 1. Let T C L and iXft gW be given. We say that dJl is an 0-minimal 
model of r in W Zjff 9Jt e Modw(T) and for all 91 € Modw(T), 

{v? G 0 1 91 h V?} {v? e 0 1 1 = rf- 

The discussion above about learning should justify the previous definition. A 
similar notion is also encountered in AI in the form of the closed-world assump- 
tion defined in [16] (for an overview see [5]), and of course in Logic Programming 
with the least Herbrand models (see [3,11]). Let T C Lhe given. Then T can have 
exactly one 0-minimal model in W, or none, or many. We denote by Mod^(T) 
the class of all 0-minimal models of T in W. Note the following: 

Lemma 2. If 0 is closed under ~ then for all T C L, Mod^(T) = Modw(T). 
We can now generalize the notion of logical consequence: 

Definition 3. Let T C L and (p G L be given. We say that is a logical 
consequence of T in W, and we write T ^p, iff every member o/Modw(T) 
is a model of ip. We say that ip is an 0-minimal logical consequence of T in W, 
and we write T ]=^ ip, ijf every member o/Mod^(T) is a model of ip. 

The notion of 0-minimal logical consequence in W is the notion of generalized 
logical consequence we investigate; the other just proves useful. Although we 
develop the theory on a very broad basis, here we consider almost exclusively 
two cases of paradigms, that we now define. 
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Definition 4. We say that tP is standard iff 7 = {Dq{TI) | OJt G W}. If 7 is 
standard and for all T G 7 and sentences (p, either T hw or T l=w ^7’ then 
we say that 7 is ideal. 

Standard paradigms are the analogues of the classical paradigms in the inductive 
inference literature. When no data are missing, the latter even correspond to 
ideal paradigms. 

4 The Hierarchies of Generalized Logical Consequences 

We now define the hierarchies of generalized logical consequences that are ba- 
sically the fundamental object of study of this framework.® First we set, for all 

Te7, s'ifr) = n^{T) = t. 

Definition 5. Let nonnull ordinal a and T G 7 be given. Suppose that ilj (T) 
has been defined for all (3 < a. A sentence <p belongs to iff there exists 

finite E CT and finite H C ilj (T) such that for all T' G 7: 

if ECr and T H then T 7- 

Definition 6. Let nonnull ordinal a and T G 7 be given. Suppose that A'^(T') 
has been defined for all T' G 7. A sentence ip belongs to Ll'^fiT) iffT |=^ p and 
there exists finite E CT and finite H C E^{T) such that for all T' G T.' 

if EC r, r hw H, and T' p then ~^p G SlfT'). 

The description of the level is based on a compactness property: a finite 
information — evidence (subset of the theory) and hypotheses (sentences that 
have been “discovered” before) — enables to conclude that p is an 0-minimal 
logical consequence of T in W. The description of the II a level is based on a 
property of weak compactness: a finite information — of the same kind as before — 
enables to conclude that p could not belong to the set of sentences that are 
not 0-minimal logical consequences of T in W, without this fact being already 
discovered. The compactness property enables to conclude with certainty that 
p is true. The property of weak compactness enables to believe confidently in 
p. Of course, the certain conclusion and the confident belief in question are not 
absolute, they are relative to that part of the hierarchy below the level which 
is currently built. Only the sentences in Ef (T) can be discovered to be true 
with absolute certainty, and only the sentences in nf (T) can be believed to be 
true with absolute confidence, as far as T is reliable, for instance, in the case of 
standard paradigms, as far as T really contains all possible observations true in 
the underlying world. Note the relationship between weak compactness and the 
notion of refutability defined by Popper ([15]). 

There are many equivalents to Definitions 5 and 6, some of them simpler, 
particularly when 7 is standard or ideal. For instance, it is easy to verify the 
following, to be used in the sequel. 

® Since member of Lf can contain infinitely many free variables, and due to the use 
of negation, it is simpler to accept only sentences in the hierarchies. 
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Lemma 7. Suppose that S’ is standard. Let T G “T be given. A sentence ip belongs 
to Sf (T) ijf there exists finite E CT such that E p. 

Lemma 8. Suppose that IP is standard. Let a ^ 0 and T in 7 be given. A 
sentence p belongs to Ll'^{T) iffT p and there is if G L’^(T) such that for 
all r G 7, ifV hw ^ and T' p, then ^p G El{T'). 

Lemma 9. Suppose that 7 is ideal. Let a > 1 and T G 7 be given. A sentence 
p belongs to S'^{T) iff there is if G U/3<a such that tp P- 

Also note the following closure property of the LL^ levels of the hierarchies: 

Lemma 10. For all a 0 and T G 7, LI^ (T) is closed under {finite) disjunc- 
tion and {finite) conjunction. 

When we expand the set of possible observations, we cannot lose any generalized 
logical consequence on any level of the hierarchy, assuming that we are dealing 
with ideal paradigms. Recall that in standard paradigms, the set of possible 
theories is uniquely determined by the set of possible worlds and the set of 
possible observations, and that ideal paradigms are standard. 

Proposition 11. Suppose that 7 is ideal. Let 7' be a standard paradigm of the 
form {S, L, W, O', T') with 0 included in O'. Then for all ordinals a and 7R G W, 
E)^{Do{im)) c E)^'{Dofm)) and n)^{Do{m)) c n^'{Do'{m)). 

Proof. Proof is by induction. Since 0 C O', Do{7H) C DofTl) for all Tl gW. 
Let a yf 0 be given, and suppose that LJp{DQ{Tl)) C Ll'^' {D qi{7JI)) for all 
(3 < a and Tl G W. Definition 5 implies immediately that for all 91t G W, 
A^(Do(9Jt)) C E)^'{E)o'{m)). Let 911 G W and p G nl{Do{7A)) be given. By 
Lemma 8, choose ip G S)^{Do{Tl)) such that for all T G T with T |=^ ip and 
T p, -.p G S)^{T). Let 91 G W with Do- {71) V' and DofTl) p 
be given. Since IP is ideal and 0 C O', IP' is ideal as well, and we infer that 
Do {71) Ip and Do (91) p. Hence ~^p G E'^{Do{7l)) which, we have 
seen, implies that ^p G (Do'(9t)). Since the same part of the proof implies 
that Ip G {DofTJl)), we conclude with Lemma 8 that p G LI^ {DofTJt)). 

When learning from just positive data versus learning from both positive and 
negative data, the corresponding paradigms are still more strongly related. First 
we need a definition. 

Definition 12. Let ordinal a and p G E be given. We say that p is Sa in IP 
just in case for all T G 7, if T p then p G S'^{T). We say that p is Ila in 
IP just in case for all T G 7, if T p then p G LL‘^ {T). 

Trivially, all sentences which are logically equivalent to the negation of a member 
of 0 are LIi in IP. We can then apply the following to 0 equal to the set of positive 
data only, and 0' equal to the set of both positive and negative data, to see how 
formulas that are generalized logical consequences of a given theory T can be 
raised from some level in the hierarchy over T defined from 0' to some level 
above in the hierarchy over T defined from 0. 
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Proposition 13. Let paradigm J" he ideal and of the form (S', G, W, O', T') with 
0' closed under Let ordinal a he given, and suppose that all members of 0' 
are LI^ in IP. For all Tl G W.- 

1. (Z?o'(91t)) C for all nonnull ordinals j3; 

2. LL"^ (Z?o'(2Jl)) C Ll'^j^^{DQ{2tSV)) for all ordinals (3. 

Proof. Without loss of generality we can assume that all members of 0' are 
satisfiable in W. If a = 0 then 0' C 0, and we conclude immediately with 
Proposition 11, so suppose a yf 0. Proof is by induction on fS. Since all members 
of 0' are LLa in IP, C LI^{DQ{dyi)) for all 971 G W, which proves the 

second inclusion in the statement of the proposition for /3 = 0 . 

Let us verify that IP is ideal. Let 971 G W be given. By the preceding relation 
Z7o(97l) I7o'(971), hence for all G L, if Do'(97l) then I?o(97l) |=^ (p. 

Since 5" is ideal, Z7o'(97l) |=^ p or Dq>(TI) for all sentences p. Since 

0' is closed under ~, {(^ G L | Do>(DJl) |=w p} = {p & L \ Z7o/(97t) p} by 
Lemma 2. We infer immediately that I7o(97t) p or Z7o(97l) |=^ ~^p for all 
sentences p. Hence IP is ideal. 

Let /3 yf 0 be given. Suppose that for all 7 < /3 and 971 G W, 11^' {Dqi{TI)) is 
a subset of LI^^^{DQ{iXfl)). Let 97t G W and p G (Z7o/(97t)) be given. Assume 
that (3=1. By Lemma 7, there exists finite E C Dq'{TI) such that E p. 
Since IP is ideal, Dq (971) E. As all members of E are Lla in IP, we infer that 
E C 7I)(’(Z7o( 971)), hence f\E € 77)(’(I?o(97t)) by Lemma 10. This, the fact that 
E P, and Lemma 9 then imply that p G E^^^iDom)- If /3 > 1, if follows 
easily from Lemma 9 and the induction hypothesis that p G S'^^^{Dq{‘0)1)). 
So we have shown that for all 97t G W, (Dqi{9JI)) C A')(’_|_^(Z7o(97t)). Let 
971 G W and p G (Dq>(WI)) be given. To complete the proof we show that 
p belongs to IT^_f_^(E>o(DJt)). By Lemma 8 , choose G LlJ (L>o>(9Jl)) such 
that for all T G T' with T ip and T p, =p G (T). Let 91 G W 
with Z7 o(91) Hw V' and I7o(91) p be given. Since IP and IP' are ideal, 
we infer that Z7o/(91) ip and Do>{2fl) p. Hence ~^p G E'^ (Z7o/(91)), so 
^p G A’)(’_|_^(I7o( 91)) as proved above. Since the same part of the proof also shows 
that Ip G i7)(’_,_^(Z7o(971)), we conclude with Lemma 8 that p G 77)(’^^(I7o(971)). 

5 Connections with Other Hierarchies 

In order to be able to establish relations between the hierarchies of generalized 
logical consequences and other hierarchies, we define still a new hierarchy, where 
the Ea’s levels are better behaved: 

Definition 14. For all ordinals a, the sets of sentences E^ and 11^ are defined 
hy induction on a, as follows. 
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2. For all ordinals a, = {^ Lp\ip & ^a}- 

3. Let a ^ 0 be given. A sentence ip belongs to S'^ iff there exists P < a with the 
following property. For all iXft G W with VJl \= p, there exists finite D C TTJ 
such that dJl \= D and D |=w p. 

Informally, we will refer to the hierarchy defined above as the uniform hierarchy. 
We will need the following properties. First note that the levels of the uniform 
hierarchy are ordered as expected in a hierarchy of a Borel type. 

Proposition 15. For all ordinals a,fi if a < then U 7T^ C Ej n iJj. 

Proof. Trivially, Eq = IJq, which implies immediately that for all ordinals fi, 
EqUUq C E'pCMl"^. Let affOhe given. For all > a, the inclusions C E“^ 
and 77^ C II J are straightforward. It is easily verified that E'^ C 77^^_;^. We 
infer that any p G 77^ is equal to ~ f/' for some G E'^, hence ^ p G Fta+ii 
hence p G Ha+i- So II ^ Q fo’a+i- The result follows. 

Remember that I is just a fragment of hence does not contain the dis- 

junction or conjunction of any of its countable subsets. So\J X and /\ X in the 
closure property below are members of but not necessarily members of L. 

Lemma 16. Let a 0, sentence p, and countable X C I be given. 

If X G1 E^ and {p ^\J X) then p G E^. 

If X C 77^ and ^ f\H) ^hen p G Fl'^. 

Adding the requirement that W consists exclusively of Henkin structures enables 
to treat existential quantifiers as countable disjunctions, and universal quantifiers 
as countable conjunctions. 

Corollary 17. Suppose that W is a set of Henkin structures. Let formula p 
with free variables x\,. .. ,Xn be given. Denote by X the set of all sentences of 
the form pf,\jx\., . . .tnjxn] for some closed terms ti, . . .t„. Let a 0 be given. 

1. IfXC E^ then 3a;i . . . 3xnP G E^. 

2- If X ^ II a then Va;i . . . Vx„(p G 77^. 

We characterize E^ and 77^, assuming that 0 is closed under the ~ operator. 
Proposition 18. Suppose that 0 is closed under ~. Let sentence p be given. 

1. p G Ef if and only if P, or ~^p, or there is a nonempty set X of 

finite, nonempty subsets of 0 such that \J{/\D\DGX}^p. 

2. p G Hi if and only if P, or ~^P, or there is a nonempty set X of 

finite, nonempty subsets of 0 such that /\{\jD\DGX}^p. 
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Proof. The proof is trivial if |=w P or |=w so suppose otherwise. Assume 
that if e iff. Let 371 G W be such that Wt \= (p. Choose a nonempty subset 
Z7srr! of Z7o(37l) such that cp. Set X = {I7ot | 371 G W and 371 ^ p>}. 

Then X is nonempty, and it is easy to verify that VIA^I^ G -- ‘P- 
Conversely, let nonempty set X of finite, nonempty subsets of 0 be such that 
hw y{f\D\D G X} ^ ip. Since f\D G for all D G X, it follows from 
Lemma 16 that p G iff. We conclude that 1. holds, and 2. is an immediate 
consequence. 

Then we characterize the other levels: 

Proposition 19. Let a > 1 and sentence p he given. 

1. p G iff iff there is nonempty X C U/3<a(^J*-f such that \/ X ^ p. 

2. p G TTf iff there is nonempty X C U/3<a(^J*-f such that /\X ^ p. 

Proof. We assume that ^p (otherwise, the proof is trivial). Suppose that 
p G iff. Let 371 G W be such that 371 ^ (/S. By Proposition 15, and since Tif is 

trivially closed under conjunction, we can choose /3 < a and -0gji G TTf such that 
Aot hw ‘P- Set X = I 371 G W and 371 1= p}. Then X and it is easy to 
verify that \J X ^ p. Conversely, let nonempty X C U/ 3 <a(^J 
such that \/ X ^ p. Then X C iff by Proposition 15 again, and Lemma 
16 implies that p G iff. Hence 1. holds, and 2. is an immediate consequence. 

The uniform hierarchy is more easily investigated than the hierarchies of gener- 
alized logical consequences. But our aim is to gain more insights into the latter, 
by transferring properties of the former. For this to be possible, we need to estab- 
lish relations between the uniform hierarchy and the hierarchies of generalized 
logical consequences. In one direction we have: 

Proposition 20. Suppose that 0 is closed under ~ and T is ideal. Then for all 
ordinals a and T G 7, Iff (T) C Iff (T) and n^(T) C n^(T). 

Proof. Proof is by induction. Since 0 is closed under ~ and 7 is standard, 
S^{T) = Iff (T) = ilf (T) = n^{T)j= T, for all T G T. Let a yf 0 be given. Let 
T G T be given, and suppose that TJf (T) is included in TJf (T) for all P < a. 
Let p G Iff (T) be given. Choose P < a and finite D C ijJ such that T D 
and D P- By induction hypothesis, D C 11^ (T), which implies immediately 
that p G Iff(T). Now suppose that for all T G 7, Iff (T) C Iff(T). Let T G T 
and p G n'^ifT) be given. If for all T' G 7, T' p, then trivially p G 11^ (T). 
Suppose there exists T' G T such that T' p. Since 7 is ideal, T' |=^ --p. 
Hence ~^p G Iff (T'). So by inductive hypothesis, ~^p belongs to Iff (T'), and we 
conclude that p G Tlf (T) . 

The other direction of the relationship between the uniform hierarchy and the 
hierarchies of generalized logical consequences is more subtle. It is given by 
following proposition, left without proof for lack of space. 
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Proposition 21. Suppose that 7’ is ideal. Let a > 1, T G 7, and ip G be 

given. There exists (i < a and V' G such that ip |=w T- 

The natural connection between the hierarchies of generalized logical conse- 
quences (for paradigms of a special form) and formula complexity is given by 
the result below, together with Proposition 20. Recall that a basic formula is 
an atomic formula or the negation of an atomic formula. Also note that the 
assumptions of the proposition below imply that 7 is ideal. 

Proposition 22. Suppose that W is a set of Henkin structures, 0 is the set of 
all basic sentences, and 7 is standard. Jjct a =/= 0 be -given. Every sentence in Ea 
{respect. Ila) prenex form belongs to E"^ {respect. Tl'^). 

Proof. Proof is by double induction on ordinals. Suppose that for all nonnull 
7 < a, every sentence in E.y prenex form belongs to E{^ (hence every sentence 
in n.y prenex form belongs to U^). Let sentence (p and ordinal (3 be such that 
(3 is the height of p and ip is in Ea prenex form. Suppose that for all 7 < /3, 
every sentence of height 7 which is in Eq, prenex form belongs to E"^. We now 
distinguish the cases corresponding to the definition of a formula being in Ea 
prenex form. Assume that ip is in or II.y prenex form for some 7 < a. If 7 
is nonnull then p belongs to Ef^ or 77.^ by inductive hypothesis, hence to E"^. 
Suppose that 7=0. Then ip \s & quantifier free member of Let 371 G W 
be such that VJl \= ip. Let D be the set of all basic sentences ip such that either 
'ip or ^ occurs in ip, and 7ft \= tp. By the choice of 0 and the fact that 7 is 
standard, it is clear that D C 77o(37l) and D \= p. Hence p belongs to Ef , 
hence to Ea- Assume that p is of form 3xip for some variable x and ip G E 
which is in Ea prenex form. Set X = {ip[t/x] \ t closed term}. Then X consists 
of sentences whose heights are smaller than [3 and which are in Ea prenex form, 
and it follows from Corollary 17 that p G A)}. Suppose that p is of form \f X 
for some (countable) set X of formulas which are in Ea prenex form. Then X 
consists of sentences whose height is smaller than (3 and which are in Ea prenex 
form, and it follows from Lemma 16 that p G A)}. 

Using some connection with the Borel hierarchy (see [18] for definitions and 
properties), we can exhibit natural paradigms such that the associated uniform 
hierarchy does not collapse to a finite level: 

Proposition 23. There exists a finite, equality free vocabulary V and a subset 
T of L}},, with the following property. Suppose that S = V, W is the set of 
Herbrand models of T, 0 is the set of basic sentences, and 7 is standard. Then 
for all nonnull n G N, 77)} \ A)} contains a Iln formula in 

Proof. Given a G {0, 1}“^^, the basic open set of the Cantor space consisting 
of members c G {0, 1}^ that extend a is denoted Oa. For all nonnull n G N, 
choose total mapping : N” ^ {0, 1}<^ such that the set A„ equal to 
HneNUiaeN ■ ■ ■ is 77„ Borel, but not A„ Borel, in the Cantor 
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space. Suppose that S = {0,s,{),P} where 0 is a constant, s a unary func- 
tion symbol, () a binary function symbol, and P a unary predicate symbol. 
Given nonnull n S N, we denote by n the term obtained from 0 by n appli- 
cations of s. Given n > 2 and terms ■ ■ ■ , tn, we denote by ■ ■ ■ , tn) 

the term (G, {t 2 , ■ ■ ■ , . . .)). Ghoose a bijective mapping / from the set 

of closed terms into N. Ghoose a surjective mapping g from the set of closed 
terms into {0, such that for all nonnull n G N and closed terms G, . . . , 

. . . ,tn)) = hn{f{ti),...,f{tn)). Given c G {0,1}”, we define 97lc to be 
the unique Herbrand structure which satisfies: 

(*) for every closed term t, Tic \= Pt iS c G Ogt^t)- 

Suppose that W = {Tic | c G (0, 1}”}, 0 is the set of closed basic formulas, and 
IP is standard. Since g is surjective, W clearly consists of the Herbrand models 
of the set of: 

1. all formulas of the form Pt, where t is a closed term with g(t) = (); 

2. all formulas of the form Pt ~^Pt' , where t,t' are closed terms such that 

g{t) % g{t') and g{t') % g{t); 

3. all formulas of the form Pt Pto V Pt\, where t, to, ti are closed terms such 
that g{to) = g{t) * 0 and g{ti) = g{t) * 1. 

For all nonnull n G N, set ipn = Vxi3a;2 . • . H((n, xi, X 2 , • ■ • , a:„)). Since / is 
bijective, it follows easily from the definition of g that for all members c of 
{0, 1}<” and nonnull n G N, Tic \= belongs to A„. Let ^ (0, 1}” 

be such that: 

— for all closed terms t, <P{P{t)) = Og(ty, 

— for all V ')/’) = U ^(V’) and ^((/? Aip) = H 

— for all nonempty sets X of members of <P{\/ X) = ^(‘P) and 

Using the definition of <P and (*) above, it is easy to show by induction that for 
all nonnull n G N, the following holds. Let (p he a member of (respect. 

There exists a closed member p* of built from 0 using V, A, \J , /\ only, and 
such that: 

— ‘P{p*) is En (respect. Pin) Borel in the Gantor space; 

— for all c G (0, 1}”, Tic \= p c & 

Let nonnull n G N be given. By Proposition 22, pn belongs to 7T(}. Suppose for 
a contradiction that Pn G Then <P{{p^)*) is Borel in the Gantor space. 
Moreover, we have seen that for every c G (0, 1}”, Tic ^ iff c G <P{{pn)*), 
and for every c G (0, 1}”, dJlc ^ iff c G A„. Hence = <!>{{pn)*), which 
contradicts the fact that it not En Borel in the Gantor space. 

The result in Proposition 23 can be transferred to the hierarchies of generalized 
logical consequences, thanks to the following corollary to Proposition 21. 
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Proposition 24. Suppose that 0^ is closed under ~ and ? is ideal. Let a ^ Q 
and (fi G L be such that tp € II ^ \ ■ Then p € iT^ (T) \ (T) for some T € T. 

Proof. Since p ^ neither p nor ^p. Suppose for a contradiction 
that for all T G 7, if p G 11^ (T) then p G So for all T G 7, if T P 

then p G H^(T). We distinguish two cases. 

1st Case: a = 1. By Lemma 7, given T G 7 with p G H^(T), we can choose 
finite, nonempty L>t Q T such that Dt Hw P- Let X be the nonempty set of sets 
of form Dt, for all T € fT with T p. Clearly, P ^ VIA Dt \ Dt G X}. 
It then follows from Proposition 18 that p G contradiction. 

2nd Case: a > 1. By Proposition 21, given T G 7 such that p G X^(T), 
we can choose (I < a and tpT G n‘^{T) such that ipT Hw P- Let X be the 
nonempty set of formulas of form ipT, for all T G T such that T |=w p. Clearly, 
Hw p ^\J X.li then follows from Proposition 19 that p G 27^, contradiction. 

6 Connections to Learning Theory 

Many classical learning paradigms can be cast in this framework; in other words, 
some learning paradigms are isomorphic to some (standard) paradigms (quin- 
tuples of the form {S,L,W, 0,T)), and many learnability results are equivalent 
to statements involving concepts defined from the notion of generalized logical 
consequence. We give a flavour of the connection, in the form of a few definitions 
and results left without proof. 

The definitions usually given in the numerical setting (see [7]) are immedi- 
ately adapted to the logical one. Given T C L, a text for T is a sequence of 
members of T with at least one occurrence of every member of T. If e is a text 
for T and k an integer, we denote by e[k] the initial segment of e of length k. 
A learner is a partial function from the set of finite sequences of members of L, 
into the power set of L. (Only uncomputable learners will be considered here, 
but the following material is easily relativized to computable learners.) 

Definition 25 ([4]). Let learner f and p G L be given. We say that f partially 
identifies p va 7 iff for all T G 7 and texts e for T, the following are equivalent: 

1- T 

2. {A: G N I ^ f{e[k])} is finite. 

If f partially identifies p in 7 and there is no T G T, no text e for T, and no 
ki,k 2 G N with ki < k 2 , p G f{e[ki]), and p ^ f{e[k 2 \), then we way that f 
partially identifies p in 7 without mind changes. 

Being at level Ei of the hierarchies of generalized logical consequences (which in 
our view is also equivalent to being a deductive consequence of the underlying 
theory) and being learnable without mind changes are basically the same notion. 
No assumption on the paradigm 7 is needed: 

Proposition 26. Let sentence p be given. The following are equivalent. 
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1. (p is in IP. 

2. Some learner partially classifies p inf without mind changes. 



Still without assumption on IP, it can be verified that level S 2 of our hierarchies 
consist of nothing but formulas that are partially identifiable: 

Proposition 27. There exists a learner f such that for all sentences ip, if p is 
S 2 in IP then f partially classifies p in T. 

The converse of Proposition 27 holds for standard paradigms (classical learning 
paradigms indeed correspond to standard paradigms) when the set of 0-diagrams 
of members of W is countable, as a consequence of the assumption of the following 
proposition together with the fact that L is countable. 

Proposition 28. Set ~ 0 = {^if\if G 0}. Suppose that T is standard and 
A belongs to L for all 9Jt G W. Let sentence p he given. The following 

are equivalent. 

1. p is U 2 in IP. 

2. Some learner partially classifies p in IP. 

3. For all T G T with T |=^ p, there exists finite D C T such that for all 
r G T, ifD CT' CT then T' T- 

Identification in the limit of classes of nonempty languages also has a natural 
analogue here. Basically, it corresponds to the hierarchies of generalized logical 
consequences collapsing to level 272. Note the similarity of clause 3. with the 
characterization of learnability of classes of nonempty recursive languages from 
positive data given in [1]. 

Proposition 29. Set 0 = 0Ll{^tp\tpG 0}. Suppose that IP is standard and 
A 27q(9DI) belongs to G for all 9Jt G W. The following are equivalent. 

1. Every sentence is E 2 in IP. 

2. For all 9Jl G W, G ilf (£>o(IH)). 

3. For all T G 7, there exists finite D CT such that for all T' G 7, if D C T' 
then r (/L T. 

There are connections between our framework and learning paradigms with un- 
countably many possible worlds (as examples of such paradigms, see [8,13]). We 
can get results similar to the previous ones with suitable sets of assumptions. 

7 Conclusion 

We still haven’t answered the following question: what is an inductive conse- 
quence of a theory T1 We think there are good epistemological reasons to claim 
that a sentence p an inductive consequence of T (in IP, assuming that T belongs 
to T) just in case v? is a member of III we would identify sentences ob- 

tained from T by one application of the compactness property with deductive 
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consequences of T, and sentences obtained from T by one application of the 
weak compactness property as inductive consequences of T. But the weak com- 
pactness property needs justifications coming from many directions. The fact 
that it enables, together with the compactness property, to build hierarchies of 
generalized logical consequences that have natural connections with other clas- 
sical hierarchies, provides another justification. Higher levels of the hierarchies 
correspond to more challenging ways of discovering the truth. With this respect, 
level S2 deserves special attention, thanks to the connection between this level 
and the notion of learnability in the limit. Another part of our project, an In- 
ductive Prolog, targets precisely these formulas. Given a possible theory T and 
a sentence ip of the form 3 xiy'ip(x,y), where ^ is a quantifier-free formula, such 
that T ip, the system will try to compute a sequence of terms t such that 
T yytp{t,y), performing top-down searches both at level S2 and at level Si. 
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Abstract. Protein conformation problem, one of the hard and impor- 
tant problems, is to identify conformation rules which transform se- 
quences to their tertiary structures, called conformations. Our aim of 
this work is to give a concrete theoretical foundation for graph-theoretic 
approach for the protein conformation problem in the framework of a 
probabilistic learning model. We propose the conformation problem as a 
learning problem from hypergraphs capturing the conformations of pro- 
teins in a loose way. We consider several classes of functions based on con- 
formation rules, and show the PAC-learnability of them. The refutable 
PAC-learnability of functions is discussed, which would be helpful when 
a target function is not in the class of functions under consideration. We 
also report the conformation rules learned in our preliminary computa- 
tional experiments. 



1 Introduction 

A protein is a chain of amino acid residues that folds into a unique native ter- 
tiary structure under specific conditions. Biochemical experiments show that 
an unfolded protein spontaneously refolds into its native structure when spe- 
cific conditions are restored. This is the basis for the hypothesis that the native 
structure of a protein can be determined from the information contained in the 
amino acid sequence. Under this hypothesis, various computational methods of 
predicting protein conformation from sequence have been proposed. 

Protein conformation is analyzed in terms of free energy, where it is assumed 
that the free energy of a native structure of a protein is the globally minimum, 
which is known as “thermodynamic hypothesis.” Many computational methods 
based on the assumption have been extensively developed. For example. Church 
and Shalloway [1] developed a top-down search procedure in which conformation 
space is recursively dissected according to the intrinsic hierarchical structure of 
a landscape’s effective-energy barriers, and Konig and Dandekar [4] applied ge- 
netic algorithms to this problem. Another interesting heuristic method is the 
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hydrophobic zipper method by Dill et al. [2]. Based on the fact that many hy- 
drophobic contacts are topologically local, the hydrophobic zipper method ran- 
domly generates hydrophobic contacts near enough in a sequence, which serve 
as constraints forcing other hydrophobic contacts to be zipped up sequentially. 

Inspired by this hydrophobic zipper method, but apart from the free-energy 
minimization problem, we introduce a hypergraph representation of the tertiary 
structure of a protein, and a eonformation rule which is defined as a rewriting 
rule of hypergraphs. 

Many simple conformation models in free-energy minimization problems use 
lattices, which are periodic graphs in two- or three-dimensional space. The con- 
formation of a protein turns to be a self-avoiding path in the lattice in which 
the nodes are labeled by the amino acids. Thus the hypergraph representation 
representation model is a generalization of the lattice model. The degree of a 
node V of a hypergraph is the number of hyperedges including v, and the rank 
of a hyperedge e is the number of the nodes in e. Because of spatial condi- 
tions of conformations, it would be natural to impose restrictions on both of 
the degrees and the ranks of a hypergraph representing a tertiary structure to 
be bounded by constants, which is helpful in learning conformation rules. We 
capture the tertiary structure of a protein as a hypergraph in a loose way, from 
which conformation rules are extracted. 

Conformation rules are repeatedly applied to a hypergraph, where the initial 
hypergraph is a hypergraph representing an amino acid sequence, called a chain- 
hypergraph. The procedure searches for a location in the current hypergraph 
which is applicable to a conformation rule, from local toward global as in the 
hydrophobic zipper method. Thus we can say that our procedure of applying 
conformation rules to a sequence obeys the “local to global” folding principle, 
which is one of the various folding principles proposed so far. The resulting 
hypergraph represents the structure of the protein. 

We then consider the problem of learning conformation rules from hyper- 
graph representations of proteins. A conformation is defined as a function from 
sequences to hypergraphs. Thus the problem is to learn functions from an ex- 
ample, that is, a pair of a protein sequence and the corresponding hypergraph 
representation. The PAC-learning paradigm was extended to include functions 
by Natarajan and Tadepalli [9] and some results on concept learning have been 
extended for functions [7,8]. 

This paper has three contributions. One is a formulation of conformation rules 
by using hypergraphs, and another is a polynomial-time PAC-learning algorithm 
for a class which is defined by this new concept of conformation rules. The other 
is some results on refutable PAC-learnability of functions, which would be helpful 
when a target function is not in the classes of functions we consider. 

We have implemented the algorithms of learning conformation rules and ap- 
plying conformation rules in the Python language [13]. Preliminary computa- 
tional experiments have been done with using TIM barrel proteins whose data 
files can be downloaded from the site of Protein Data Bank (PDB) [14]. The 
results of the experiments are also reported. 
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2 Preliminaries 

A hypergraph H = {V, E) consists of a set V of nodes and a set E of hyperedges 
each of which is a nonempty subset of V . In this paper we assume that |e| > 2 
for all e G if without any notice. The rank of ii is r(ii) = maxe^E |e|. For a 
node V, the degree of v is dniv) = |{e G E \ v G e}\ and the degree of El is 
d{H) = maxy^v dniv). A chain-hypergraph is a hypergraph H = (V,E) such 
that V = {1,2 , . . . ,n} for some n > 1 and each {i,i + 1} is contained in some 
hyperedge in E for 1 < t < n—1, i.e., there is e G if with {i, i+1} C e. Especially, 
a chain-hypergraph H = (V, E) is called a rank k linear chain-hypergraph if 
E = {{z , . . . ,i-\-k — l} I i = — fc-l-1}. For a set E of hyperedges, we call 

simplify(if) = if — (e G if | there is e' in E with e C e' and e yf e'} 

the simplification of E. In this paper we consider a hypergraph H = (V,E) 
whose nodes are labeled with a mapping ip : V —> A, where A is an alphabet. 
It is denoted by ii = (V,E,ip), and called a hypergraph over A. We confuse 
H = (y,E,ip) with El = (V,E) without any notice. Let H = (V,E,ip) and 
V' C V. For convenience, we denote by H{V') the subhypergraph H = {V , E, ip) 
of H where 

- if = U«6V'{e G if I z; G e}, 

-Y = UeeifeUW, 

— Ip = ip\y^ that is, the restriction of ip to V. 

This subsection reviews some notions and results on the PAC-learnability of 
a class of functions by following Natarajan [7,8]. For an alphabet 17, the set of 
all strings over 17 is denoted by 17*. The length of a string x G 17* is denoted by 
jxj. For n > 1, 17[”1 = (x G 17* | jxj < n}. Here, the alphabet 17 is assumed to 
be finite. 

Definition 1 ([7,8]). Let F be a class of functions from a finite set X to a finite 
set Y. The generalized VC-dimension of E, denoted by D{E), is the maximum 
over the sizes \Z\ of subsets Z C X such that there exist two functions f and g 
in E satisfying the following conditions: 

E /(x) yf g{x) for all x G Z . 

2. For all Zi C Z , there exists h G E that agrees with f on Zi and with g on 
Z- Zi. 

Lemma 1 ([7,8]). Let F be a class of functions from a finite set X to a finite 
HPf TflFH 

2D{F) ^\F\<\x\D{F)\Y\-^-DiF)^ 

Let / : 17* ^ 17*. For integers ni,rz2 > 1, the projection /["d["2 ] j 
17["il xl7["^l is the function /Ed ["2] . j7E2] defined by /Ed["2]('2.^ _ 

if /(x) is in I7E2] for all X in 17Ed, jf there is some x in 17Ed such that /(x) is 
not in 17["2]^ then /["d["2] is undefined. For a class F of functions from 17* to 
17*, we define 



plni][n^] ^ {j[ni][n 2 ] \ f (z /Ed [" 2 ] jg defined}. 
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Definition 2 ([7,8]). Let F be a class of functions from to f2* with a rep- 
resentation R. An algorithm A is said to he a polynomial-time fitting for F in 
representation R if the following conditions hold: 

1. A is a polynomial-time algorithm taking as input a finite subset S of f2* x [2* . 

2. If there exists a function in F that is consistent with S, A outputs a name 
of the function in representation R. 

We say that F is of polynomial-dimension if there is a polynomial p{n\,n 2 ) in 
n\ and ri 2 such that < p(m,n 2 ). We say that F is of polynomial- 

expansion if there exists a polynomial q{n) such that for all / G and x G f2*, 
|/(a^)| < 'Z(|a;|)- The following theorem will be used to prove a result in Section 5 
on the PAC-learnability of conformation rules. 

Theorem 1 ([7,8]). Let F be a class of functions from f2* to f2* with a repre- 
sentation R. F is polynomial-time PAC-learnahle in R if the following hold: 

1. F is of polynomial- dimension. 

2. F is of polynomial- expansion. 

3. There exists a polynomial-time fitting for F in R. 

3 Hypergraph Representation of a Protein 

Let P be the protein of a primary structure A 1 A 2 • • ■ An, where Ai represents 
the z-th amino acid residue. Its tertiary structure is usually represented by a 
sequence of the positions of amino acid residues in the three dimensional space 
as (pi, Al), (p 2 , A 2 ), . . . , {pn,An), where Pi = {x^,yi,Zi) is the position of Ai for 
I <i < n. The distance between pi and pj is denoted by \pi — Pj\. Let F be the 
alphabet consisting of symbols representing the amino acid residues. 

Let /i > 0 be a real number. For a protein P with a tertiary structure (pi, Ai), 
(P 2 , A 2 ), . . . , {pn, An), let Gp = (y, E) be an undirected graph defined as follows: 

1. V = {l,2,...,n}. 

2. For any distinct i,j in V with \pi — Pj \ < p, {i,j} is in E. 

We call the undirected graph Gp = {V, E) the structure graph of P with p-range. 

For positive integers k, co, t and Gp = {V,E), let -^p’^^^plete 
the hyperedges e C P satisfying the following conditions: 

^ 2 < |e| < /c, 

— max e — mine + 1 > r, that is, a restriction on the width of e on the sequence 
1,2,. ..,n, 

Gf, [e] is a complete graph, 

where Gp[e] is the node-induced subgraph of e in Gp. Let 

^pbackbone = * + 1, • ■ • , j} I J = ^ + 02 - 1, 1 < i < n - w + 1}, 
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and •. V ^ S he & mapping defined by ip{i) = Ai for 1 < i < n. Then a 
hypergraph with 

E' = s™Plify(^P^omplete) ^ ^p, backbone 

is a chain-hypergraph over if, which is called the hypergraph representation of 
P over E by eomplete graphs with pi,k,uj and t. 

We say that an undirected graph G = (V,E) where V = {vq, fi, ■ • ■ , ffc} 
and E = {{uojfi} \ Vi & V,Vi Po} is a star graph. Let be the 

set of the hyperedges e C satisfying the following conditions: 2 < \e\ < k, 
max e — min e -I- 1 > r, and Gp[e] is a star graph. Then a hypergraph = 

{V,E\f;)with 

E" = U Bpbackbone 

is a chain-hypergraph over E, which is called the hypergraph representation of 
P over E by star graphs with p,, k, uj and r. 

Instead of the explicit representation with amino acid residues, it is often 
used to classify the amino acid residues into several categories (e.g., [2,10,11]). 
In order to deal with such cases, we represent a protein in a more extended 
way. Namely, we consider chain-hypergraphs whose nodes are labeled with some 
“colors”, which are not necessarily the same as the amino acid residues. Let A be 
an alphabet which consists of such “colors” labeling the nodes of hypergraphs. 
In this paper, we assume that the tertiary structure of a protein is represented 
by a chain- hypergraph over some alphabet in a way mentioned above. 

4 Conformation Rules 

In this section, we define a conformation which transforms strings over A to 
chain- hypergraphs over Z\. We denote the set of all chain- hypergraphs over A 
by Ha- 

Definition 3. A conformation over A is a function c : A~^ Ha such that 
c{x) = {V, E, iji) for a string x = Xi ■ ■ ■ x„ & A~^ satisfies V = {1, . . . , n} and 
if{i) = Xi for 1 < i < n. 

We give a way of completing a conformation by introducing conformation 
rules over A, which is based on hypergraph rewriting rules defined as follows: 
A hypergraph rewriting rule over Z\ is a triplet p = (B,A,D) of a hypergraph 
B = {V,E^f}) over A and subsets A and D of 2^. The elements of A and D 
are called additional and removable hyperedges, respectively. The rank of p is 
defined to be max{r(B), max{|a| | a G A}}. The degree of p is defined to be 
d{B). 

Definition 4. Let pi = (Bi,Ai,Di) and p 2 = (B 2 ,A 2 ,D 2 ) be hypergraph 
rewriting rules over A where B\ = {Vi,Ei,ifi) and B 2 = (V 2 , A 2 , V’ 2 )- We say 
that Pi is isomorphic to p 2 , denoted by pi « p 2 , if there is a bijection i \ V\ ^ V 2 
such that 
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1. '4’\{v) = 'ip2{i'{v)) for all V € Vi, 

2. i(ei) € E2 for all ei € Ei, and t“^(e 2 ) € E\ for all 62 € E2, 

3. i{e\) € A2 for all ei € Ai, and t~^{e2) G Ai for all 62 G A2, 

4- i(ei) G D2 for all ei G D\, and ^“^( 62 ) G Di for all 62 G D2- 

Definition 5. Let Da be a set of hypergraph rewriting rules over A. For positive 

integers P and Q, we define a {P x Q)- conformation rule a over Da as 

er= (Pi,(32,---,Pp), 



where 

l^p — (7p,i> 7 p,2) • • ■ ) 1p,q) 

with 

^p,q — Ea 

for 1 < p < P and 1 < q < Q. is called the {p,q)-unit of a, and (3p is the 
pth unit-sequence of a. Da is the domain of a. The rank of Da is defined as 
^{Da) = max{r(i7) | iJ € Da}, and the degree of Da is d{DA) = max{d(i7) | 
H G Da}- The rank of a is max{r( 7 p_g) |l<p<-P, !<(?< Q}, and the 
degree of a is max{(i( 7 p_g) \ 1 < p < P, 1<9< Q}- 

In this paper, we consider a rather limited hypergraph rewriting rules defined 
in the following way: 

Definition 6. A bundle rule over A is a hypergraph rewriting rule p = {B, A, D) 
over A if, for B = {V, E, if) over A, 

1. | 4 l| = l, sayA = {U}. 

2. \U\ > 2. 

3. U^E. 

4- For any hyperedge e in E,enU^ib. 

5. D = {eG E\eCU}. 

For short, we denote such a bundle rule p = {B, A, D) by {B, U). 

We denote by Pa the set of all bundle rules over Z\, and, for integers k > 2 
and d > 1, by Pk,d,A, the set of all bundle rules over A such that the rank is at 
most k and the degree is at most d. 

Remark 1. Obviously, Fa is infinite. Note that Pk,d,A is finite if A is finite. On 
the other hand, Ufc >2 Ud>i ^k,d,A are infinite. 

We here describe a concrete conformation, which is a function transforming 
strings to hypergraphs by using conformation rules. Let cr = P2, ■ ■ ■ , Pp) 

be a (P X (5)-conformation rule over Fa where Pp = Xp,2, ■ ■ ■ , ^Kp^q) and 

^p,q Q Pk,d,A for 1 < p < P and I < q < Q- We apply cr to a string x = x\ ■ • ■ Xn 
in Z\+. For a positive integer ui, we start with a rank co linear chain-hypergraph 
H = (y,E,if), that is, V = n}, ■*/'«(*) = Xi for 1 < i < n, and E = 

{{i, . . . ,i to — 1} I 1 < i < n — w-l-1}. At the pth stage (1 < p < P), the 
pth unit-sequence Pp of a is used in the following way. In each stage, a window 
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on the node sequence 1, 2, . . . , n corresponding to the string cc = cci • • • is an 
interval of the sequence, and enlarged from smaller to larger. The initial window 
size is specified by t. For each window size, the window is slided from left to 
right on V. Suppose a window of size w(> r) at position i, that is, an interval 
, . . . , t + tc — 1] consisting of consecutive w nodes in V. Let q = w — t + 1, whose 
range is from 1 to Q. The bundle rules in the (p, g)-unit of a are applied to 
create new hyperedges e such that e consists of only nodes in [i, . . . ,i + w — l] and 
i and i + w — 1 are in e. A new creation of a hyperedge e in the window depends 
on a local structure around e in the current hypergraph H = {V, E, ip). Namely, 
we consider a subhypergraph H{e). A new hyperedge e will be created if there 
is a bundle rule {B,U) € 7 p,, which is isomorphic to (H{e),e). After creating 
all new hyperedges in the process of sliding the window from left to right, these 
hyperedges are added to E and the proper subsets of them are deleted from 
E, and this window sliding process is repeated after the window is enlarged. A 
formal description is given in Fig. 1. 



Input: a (P X Q)-conformation rule a = (/?i, • • • , Pp) over Fa where 

Pp = (7p,i>7p,2, • • • , 7 p,q) and 7p,g C P/i for 1 < p < P and 1 < q < Q, 
positive integers uj and t with u) < t, and a string x = x\ ■ ■ ■ Xn m A~^ 
Output: a hypergraph H — {V, E, ip) 

Procedure: COMEOTZM (w, t, a, x) 
let k be the rank of a 
V = {l,...,n} 

let Ip he a. mapping defined by ip{i) = Xi for 1 < i < n 
E = {{i, i + CO — 1} I l<i<n — w + 1} 

H = (V, E, Ip) # linear chain-hypergraph of rank co 
for p from 1 to P: 

for q from 1 to mm{Q,n}: 

w = T + q — 1 # w is the window size 

A = 0; P = 0 

foreach i from 1 to n — w + 1: 

j = i + w — 1 

foreach e C {i, . . . , j} such that i,j € e and |e| < k: 
if a bundle rule (P(e),e) « p for some p in 7 ^,^: 
add e to A 

add the proper subsets of e in P to P 
E = EUA\D 



Fig. 1. Algorithm CONTOUM 



The graph G given in Fig. 2 is an example of the graphs which cannot be 
generated by any (1, Q)-conformation rule for any Q. 

The following proposition is obvious by definitions: 
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Fig. 2. a = ((71,1), (72,1), (73,1), (74,1)) which is a (4 x l)-conformation rule over 
A, 4, {0,1} generating the graph G 



Proposition 1. Let a be a {P x Q)- conformation rule over Pk,d,A! 
and T he positive integers with lo < t, and x G . The hypergraph 
COAf!FOTZM{ijj,T,a,x) given in Fig. 1 is a chain-hypergraph over A of at most 
rank k. 

Definition 7. For a {P x Q)- conformation rule a over Fa and positive integers 
uj and T with uj < t , we define a conformation as a function from Z\'*" to 
the set of chain-hypergraphs over A, by df.’'^{x) = COJ\fFOTZM.{uj,T,a,x) for 
X G A+. 

5 PAC-Learning of Conformation 

For a positive integer n, let Ti.^^ be the set of all chain-hypergraphs over A with 
at most n nodes. By we denote a function : Z\[”l — > Tt^^ obtained 

by restricting to 

For integers w> 2 ,t>w, P, Q> 1 , and an alphabet A, let 

qu^t,p,q _ I (j is a (P X Q)-conformation rule over Fa}- 

As noted in Remark 1 , the alphabet Fa is infinite even if A is finite. This makes a 
trouble in discussing the PAC-learnability of a class of conformations. However, 
if we restrict the rank and degree of conformation rules to constant integers k 
and d, respectively, the alphabet Fk^d,A is finite for finite alphabets A. Let 

^k’d'A^ = I (j is a (P X Q)-conformation rule over Pfe,d.zi} 

for integers k>2,d> l,oj > 2 ,t > uj, P > 1 and Q > 1. 

Our main result is the following theorem: 

Theorem 2. The class is polynomial-time PAC-learnahle. 
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Theorem 3. The class polynomial-time PAC-learnable. 

We can prove these theorems by showing that these classes satisfy three 
conditions in Theorem 1. 

For an integer fc > 2, a hypergraph H = (V,E,'ip) of rank k with n = \V\ 
can be expressed under an appropriate encoding as a string over A whose length 
is polynomially bounded with respect to n. Thus we regard a conformation c 
over Z\ as a function from Z\+ to Therefore we can see that any class of 
conformations over A is of polynomial-expansion. 

Next we show that and Uit>i d of polynomial-dimension. 

Let 

^kd’A^^ ^ ~ I (j is a (P X (5)-conformation rule over 

By Lemma 1, it suffices to show that ^ and Ui^>l ’I ^ 

bounded by 2^^"^ for some polynomial p{n). A {P x Q)-conformation rule a 
over Pk,d,A can be considered as a P x Q matrix whose elements are subsets 

of Pk,d,A- Since \Pk,d,A\ is a finite constant, say 5, we have ^ < 

(2*^)^'^, that is, C^'d'A^'^ ^ i® ^^®o bounded by a finite constant. It should 

be noted here that Ufl>i ^ = Un>fl>i Thus, we can see that 

^ which is bounded by 2^i"i for some polynomial 

p{n). 

Finally we discuss polynomial-time fittings for and Uk>i ^kd’A^- ii' 

is trivial that there is a polynomial-time fitting for since the cardinality 

of the class is a finite constant. 

We then describe a polynomial-time fitting B for 
ploying the algorithm SXTTZACT given in Fig. 3. Given chain-hypergraphs 
Hi = {Vi, Ei,tpi), . . . ,Plt = {Vt,Et,ipt) over A and positive integers w and r 
with T > (V, the algorithm B computes, for 1 < ft, < t, a conformation rule over 
Ea, = EXTTZACT{oj,t, N, Elh), where N = maxi<h<t \Vh\- We denote by 
7^9 the (l,g)-unit of for I < q < N. For each q with 1 < g < N, let 71 , , 

= Ui<h<t7iA> and a = (( 717 , 71 , 2 , • ■ • ,7i.iv))- The algorithm B outputs a from 
Ell , ■ ■ ■ , Ht- Obviously, Q runs in polynomial time since the rank of conformation 
rules is a constant k. 

If Hi = COAfEOUM{uj,T,c7,si), H^ = COAfEOnM{w,T, < 7 , 82 ), 
Pit = COJ\fEOTZM{u!,T,a,St) for some (1, Q)-conformation rule a 
over Pk,d,A and strings si,S 2 ,...,St € A^, then we can show that 

Hh = COAfEOTZM{u!,T,d-,Sh) for 1 < ft < t, which means that 

COMEOTZM{uj,T,a,-) is consistent with the examples {{si,Hi) | 1 < i < t}. 
For 1 < h < t and 1 < g < IV, let 

^ Ch,q be the contents of E just after the gth iteration of the for-loop on q of 
the 1st iteration of the for-loop on p of COJ\fEOTZM{uj,T,a,Sh) has been 
finished if <7 < min{Q, |s/j|}, Ch^q = Ch,q-i otherwise. 
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Input: a chain-hypergraph H = {V, E, tp) over A of rank k, and 

positive integers oj, r and R 

Output: a conformation rule a = (/3i) over Fa of rank k where 

Pi = ( 71 , 1 , 71 , 2 , • • • ,7i,«) with 71,5 C for 1 < (j < i? 
Procedure: EXTRACT (uj, r, R, H) 
n=\V\ 

E — {{i, . . . ,i + u) — 1} I l<i<n — w-l-l} 

H = {V,E,iP) 
for q from 1 to R: 
w = T + q — 1 
A = 0 
D = 9 

foreach i from 1 to n — w -I- 1: 



j = i + w — 1 

foreach U F {i, . . . , j} such that i,j £ U and \U\ < k: 
if U G E: 

p={H{U),U) 
add p to 7 q 
add U to A 

add the proper subsets of t/ in i? to D 
E = EUA\D 



Fig. 3. Algorithm EXTRACT 



— Eh,q be the contents of E just after the gth iteration of the for-loop of 
SXTRACT{lo,t, N, Hh) has been finished. 

^ Ch,q be the contents of E just after the 9 th iteration of the for-loop on q of 
the 1 st iteration of the for-loop on p of COMEORA4{lo,t, a, Sh) has been 
finished if q < |s/j|, Ch,q = Ch^q-i otherwise. 

For convenience, let Cup = Ehp = Chp = {{*, ■ • ■ , i+w— 1} | 1 < i w-|-l} 

for 1 < h < t, which the initial hyperedges in the algorithm COMTORXi and 
EXTRACT on the string Sh- It is not hard to prove by induction on q 

C'h,q = Efi^q = Ch,q 

for 1 < h < t. This completes the proof. 

The following theorem can be shown in a similar way and we omit its proof. 

Theorem 4. The class E polynomial-time PAC-learnable. 

6 Refutably PAC-Learning Functions 

In this section, we introduce the refutability of PAC-learning algorithms on func- 
tions. The refutability of PAC-learning algorithms on concepts have been already 
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discussed in [5,6]. PAC-learning algorithms having the ability to refute classes 
which do not seem to include a target function would be helpful in dealing with 
real data. 

Let / be a function from 17* to 17*, F be a class of functions from 17* to 17*, 
and P be a probability distribution on 17* . 

We define optf{P, F) by 

optf{P,F) = minP(/'A/). 

j'GF 

We can see that if / € P then optf{P, F) = 0 for any P. 

Definition 8. Let F be a class of functions from 17* to 17*. A function class 
F is polynomial-sample refutably learnable if there exist an algorithm A and a 
polynomial p{- which satisfy the following conditions: 

1. The algorithm A takes as input parameters e,e',<5 G (0,1) and n > 1. We 
call e' a refutation accuracy parameter. 

2. Let f be a target function from 17* to 17* and P an arbitrary and un- 
known probability distribution on 17*. The algorithm A takes a sample of 
size p{l/e, Ijs' , 1/5, n) using a subroutine EX{f,P), which at each call pro- 
duces a single example for f according to P. 

3. Lf optf{P,F) = 0 then A outputs a function g G F which satisfies P(/Ag) < 
£ with probability at least 1 — 15. Lf optf{P, F) > e' then A refutes the function 
class F with probability at least 1 — <5. 

Theorem 5. Lf a class F of functions is of polynomial dimension, then F is 
polynomial-sample refutably learnable. 

By this theorem the followings hold: 

Corollary 1. The classes and Uk>i polynomial-sample 

refutably learnable. 

Since P is of polynomial dimension, there exists a polynomial poly{-, •) such 
that log 2 |p["d[" 2 ] I < poly(ni,n 2 ) for any ni,U 2 > 1. We construct the algorithm 
described in Figure 4. 

We introduce a refutation threshold parameter rj G (0, 1) so that a learning 
algorithm produces an approximate function instead of refuting P when the 
minimum error optf{P, F) is small enough. 

Definition 9. Let F be a class of functions from 17* to 17*. A function class 
F is polynomial-sample strongly refutably learnable if there exist an algorithm A 
and a polynomial p{-, ■,■,■) which satisfy the following conditions: 

1. The algorithm A takes as input parameters e, e' , 5,r\ G (0, 1) and n > 1. 

2. Let f be a target function from 17* to 17* and P an arbitrary and un- 
known probability distribution on 17*. The algorithm A takes a sample of 
size p{l/s, 1/A, 1/6, n) using a subroutine EX{f,P), which at each call pro- 
duces a single example for f according to P. 
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Input: e,e' ,5,n\,n2 

Procedure: 

let m = [(1/e + l/e')(l/<5 + n 2 ))] 

make m calls of EX 

let S be the set of examples seen 

if there is a function g G F consistent with S: 

return g 

else 

refute F 



Fig. 4. Refutable algorithm ARefuteBySampieCompiexity{e,e' ,S,m,n 2 ) 



3. If optf{P, F) <7] then A outputs a concept g G F which satisfies P{fAg) < 
g + e with probability at least 1 — i5. If optf{P, F) > p + s' then A refutes the 
function class F with probability at least 1 — <5. 

Theorem 6. If a class F of functions is of polynomial dimension, then F is 
polynomial-sample strongly refutably learnable. 

Corollary 2. The classes and Uk>i polynomial-sample 

strongly refutably learnable. 

We construct the algorithm described in Figure 5. We denote by d{g, S) the 
number of examples in S with which g does not agree. 



Input: e,e' ,S,g,ni,n 2 

Procedure: 

K — min{e, e'} 

m = [4 (l/e^ + 1/e'^) (1/(5 -\- poly{m,n 2 ))] 

make m calls of EX 

let S be the set of examples seen 

if there is a function g G F with d{g, S) < [m {g + (1/2)k)J then 
return g 

else 

refute F 



Fig. 5. Strongly refutable algorithm AstronglyRefuteBySampleComplexity{e,e' ,S,g, m, 
U2) 



We can easily see that F is of polynomial dimension if F is polynomial-sample 
refutably learnable or polynomial-sample strongly refutably learnable. Therefore 
the following three statements are equivalent: 

1. f is of polynomial dimension. 

2. F is polynomial-sample refutably learnable. 

3. f is polynomial-sample strongly refutably learnable. 
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7 Experiments 

In this section, we report our preliminary computational experiments on learning 
conformation rules from hypergraphs representing tertially structures of pro- 
teins. We have implemented the PAC-learning algorithm shown in the algo- 
rithms CONTOTZM{uj,T,a,x) and £XTTZACT{lo,t, R, H) in the Python lan- 
guage [13]. 

7.1 Method of Experiments 

The hypergraph representation of a protein over A by star graphs are used 
with /i, fc,u;,r specified as follows: {j, = 5.8A, fc = 10, w = 5 and r = 8. The 
choice of the alphabet A for labeling the nodes of a hypergraph is one of the 
key to experiments. The alphabet A represents a classification of amino acid 
residues. In Hart and Istrail [3], they used the hydrophobic-hydrophilic model 
that regards a protein as a linear chain amino acid residues that are of two types 
H (hydrophobic) and P (hydrophilic). However some amino acids are neither 
hydrophobic nor hydrophilic. In our experiments, A is set to {H, P, N}, where 
the amino acid residues are assigned as follows: 

H : ALA, CYS, ILE, LEU, MET, PHE, TRP, VAL, 

P : ARC, ASN, ASP, GLN, GLU, LYS, PRO, ASX, GLX, 

N : GLY, HIS, SER, THR, TYR. 

The class of conformations, where P = 1 and Q = 2, is considered in 

the experiments (Since the degree bound d is not important rather than the rank 
bound k, d is unlimited.). Given examples (si, iLi), . . . , (s*, P[t), the polynomial- 
time fitting B, used to prove Theorem 3, outputs a (1, 2)-conformation rule a, 
which is applied in COAfPOTZM{tu,T,d,x) for a sequence x. 

To evaluate how a hypergraph predicted by COAfPOTZM is similar to the 
target hypergraph, we compare them hyperedge by hyperedge. To this end, we 
define a similarity between hyperedges as follows: Let g > 0 and 0 < k < 1, and 
subsets El and E 2 of 2^, where V = {1,2,..., n}. For ei G Ei and 62 G E 2 , we 
say that e\ is (g, k)- similar to 62 if min 62— g < minei < min 62-1-5 ^nd eiAe2 < 
K. We denote Sirrig^K^Ei, E 2 ) = |{ei G Ei | ei is {g, K)-similar to 62 G i?2}| 
TIM-barrel proteins have high regulatory conformations, which are composed 
by eight parallel /3-sheets forming a barrel structure [12]. We downloaded PDB 
files of TIM-barrel proteins from the site of PDB [14], which are screened out. 
The 15 proteins remains, whose tertiary structures are fully determined and 
composed of a single chain of amino acids. 

In our experiments, the following small modification has been done: for a 
bundle rule p = {B, A, D) where A = {U}, D \s set to 0 instead of (e G E | e C 
[/}, which affects nothing but would enable to attain more detailed conformation 
rules. 

7.2 Evaluation 

We have executed two kinds of experiments. One is self-conformation, that is, for 
a single protein p, a (1, 2)-conformation rule a is learned from the hypergraph 
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representation of p, and used in CONTOTZM with the sequence of p. Another 
is the case where a (1, 2)-conformation rule a is extracted from 14 TIM-barrel 
proteins, and applied to the remaining one. 

In self-conformation, the successful results are attained. Let Ht = (V, Et, tp) 
and Hp = (V,Ep,ip) be a target and a predicted hypergraph, respectively. For 
a set S, by we denote the complement of S. We give a typical results of self- 
conformation test in Tab. 1. Since the experiment is going well under the window 
sizes 7 and 8, the experiment should be continued with the window sizes over 
8. However, if it is done, the procedure does not finish in a practical time. The 
task of hypergraph matching is repeatedly done in our procedure. An efficient 
and practical algorithm for the problem of hypergraph isomorphism should be 
developed, which would be one of the future works. 



Table 1. Result of self-conformation with protein 4ALD, whose sequence is of length 
363. The backbone hyperedges are excluded. 



window size 


Ep C\ Et 


Ep C\ E^ 


Ep n Et 


7 


69 


0 


0 


8 


14 


0 


0 



Tab. 2 shows the result of conformation of protein 4ALD obtained by apply- 
ing a (l,2)-conformation rule learned from the other 14 TIM-barrel proteins. In 
the stage of window size 7, 23 (= 6-1-17) hyperedge are added, 6 hyperedges of 
which are similar or exactly identical to hyperedges in the target Elp- However, 
the remaining 17 hyperedges are wrong, that is, there are no similar hyperedges 
to them in Elp- An interesting observation is that correct hyperedge addition 
often occurs in a neighborhood, which would imply that the conformation rule 
causing correct hyperedge addition captures some regional property common to 
several proteins. In the stage of window size 8, no hyperedge is added. This is 
because, once a wrong hyperedge is added, the wrong hyperedge makes it diffi- 
cult to add correct hyperedges in the following stages with larger window sizes. 
To settle this problem is also a future work. 



Table 2. Result of conformation of protein 4ALD applied a (l,2)-conformation rule 
learned from the other 14 TIM-barrel proteins. 



window size 


Sim2,o.s{Ep, Et) 


Sim2fi.s{ET, Ep) 


Ep n Ep 


Ep n Et 


7 


6 


9 


17 


63 


8 


0 


0 


0 


14 
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8 Concluding Remarks 

In this paper, we formulated the protein conformation problem as the PAC- 
learning problem of hypergraph rewriting rules from hypergraphs. Since, in 
terms of the protein conformation problem, our graph-theoretic approach is very 
unique, this learning problem should be extensively studied with adding appro- 
priate modification to the framework we proposed this time, although the current 
results of our preliminary computational experiments are far from satisfaction. 
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Abstract. It is an up-to-date challenge to get answers for novel ques- 
tions which nobody has ever considered. Such a question is too rare to 
be satisfied with a past single document. In this paper, we propose a 
new framework of knowledge navigation by graphically providing with 
multiple documents relevant to a user’s question. Our implemented sys- 
tem named MACLOD generates several navigational plans, each form- 
ing a complementary document-set, not a single document, for navi- 
gating a user to understanding a novel question. The obtained plans are 
mapped into a 2-dimensional interface where documents in each obtained 
document-set are connected with links in order to support user select- 
ing a plan smoothly. In experiments, the method obtained satisfactory 
answers to user’s unique questions. 



1 Introduction 

It is an up-to-date challenge to answer a user’s novel question nobody has ever 
asked. However, such a question is too new to be satisfied with a past single doc- 
ument, and the required knowledge for understanding the documents relevant 
to a user’s question depends on his background [4]. In our previous work[3], we 
proposed a novel information retrieval method named combination retrieval for 
creating novel knowledge by combining complementary documents. Here, a com- 
plementary set of documents is composed of documents, and the combination 
of which supplies a satisfactory information. This idea is based on the principle 
that combining ideas can trigger the creation of new ideas[l,2]. Throughout the 
discussions of the work, we verified the fact that reading multiple complementary 
documents generates the synergy effects which help us acquire novel knowledge. 

In this paper, we propose a new framework of knowledge navigation, i.e., 
supply a user with new knowledge, for satisfying the information request of a 
user by visualizing complementary documents. Our implemented system named 



K.P. Jantke and A. Shinohara (Eds.): DS 2001, LNAI 2226, pp. 258—270, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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MACLOD{Ma,\) of Complementary Links of Documents) generates several nav- 
igational plans, each formed by a document-set for navigating a user to un- 
derstand a novel question, by making use of the combination retrieval[3]. The 
obtained plans are mapped into a 2-dimensional interface where documents in 
each document-set are connected with links in order to support user selecting 
complementary documents smoothly. 

The remainder of this paper goes as follows: In Section 2, the meaning of our 
approach is shown by comparison with previous knowledge navigation methods. 
The mechanism of combination retrieval is described in Section 3, and the mech- 
anism of MACLOD implemented here is described in Section 4. We show the 
experiments and the results in Section 5, showing the performance of MACLOD 
for medical counseling question-answer documents. 



2 Previous Methods for Knowledge Navigation 

The vision of knowledge navigation was shown by John Sculley(Then the pres- 
ident of Apple Computer Inc.) where electronic secretary in a computer named 
Knowledge Navigator managed various tasks on behalf of users, e.g., manage 
schedules. The concept inspired us. However, it is still difficult to realize the 
Knowledge Navigator because of the complexity of real secretary’s tasks. 

A knowledge navigation system is a piece of software which an- 
swers a user’s question. The question maybe entered as a word-set query 
{alcohol, liver, cancer} or a sentence query “Does alcohol cause a liver cancer 
?” An intelligent answer to this question may be “No, alcohol does not cause 
liver cancer directly. You may be confused of liver cancer and other liver dam- 
ages from alcohol. Alcohol causes cancer in other tissues.” For giving such an 
answer, the system should have medical knowledge relevant to user’s query, and 
infer on the knowledge for answering the question. However, it is not realistic to 
implement such knowledge wide enough to be applied to unique user interests. 

Another approach for navigating knowledge is to retrieve ready-made doc- 
uments relevant to the current query, from a prepared document collection. In 
this way, we can skip the process of knowledge acquisition and implementation, 
because man-made documents represent the complex human knowledge directly. 
A search engines for a word-set query entered by the user may be the simplest 
realization of this approach. However, we already know that existing information 
retrieval methods trying to answer a query by ONE of the output documents 
could not satisfy novel interests in Section 1. 



3 The Process of Combination Retrieval 

Combination retrieval[3] is a method for selecting meaningful documents which, 
as a set, serve a good (readable and satisfactory) answer to the user. In this 
section, we review the algorithm of the combination retrieval. 
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3.1 The Outline of the Process 

The process of combination retrieval is as follows: 

The Process of Combination Retrieval 
Step 1) Accept user’s query Qg. 

Step 2) Obtain G, a word-set representing the goal user wants to understand, 
from Qg (G = Qg if Qg is given simply as a word-set). 

Step 3) Make knowledge-base S for the abduction of Step 4). For each doc- 
ument Dx in the document-collection Cdoc, a Horn clause is made as to 
describe the condition (words needed to be understood for reading D^) and 
the effect (words to be subsequently understood by reading D^). 

Step 4) Obtain h, the optimal hypothesis-set which derives G if combined with 
E, by cost-based abduction (detailed later), h obtained here represents the 
union of following information, of the least size of K. 

S: The document-set the user should read. 

Ki The keyword-set the user should understand for reading the documents 
in S. 

Step 5) Show the documents in S to the user. 

The intuitive meaning of employing the abductive inference is to obtain the 
conditions for understanding user’s goal G. Here, conditions include the docu- 
ments to read (S') for understanding G, and necessary knowledge (K) for reading 
those documents. That is, S means the combination of documents to be presented 
to the user. 

3.2 The Details of Combination Retrieval’s Process 

In preparation, collection Gdoc of existing human-made documents is stored. 
Key, the set of keyword-candidates in the documents in Gdoc, i-e. word-set which 
is the union of extracted keywords from all the documents in Gdoc, is obtained 
and fixed. Here, words are stemmed as in [5] and stop words (“does”, “is”, “a”...) 
are deleted, and then a constant number of words of the highest TFIDF values 
[6] (using Gdoc as the corpus for computing document frequencies of words) are 
extracted as keywords from each document in Gdoc- Next, let us go into the 
details of each step in 3.1. 

Step 1) to 2) Make goal G from user’s query Qgi Goal G is defined as 
the set of words in Qg H Key, i.e., keywords in the user’s query. For example, 
“does alcohol make me warm?” and query {alcohol, warm} are both put into 
the same goal {alcohol, warm}, if Gdoc is a set of past question-answer pairs 
of a medical counselor which do not have ’’does”, ’’make”, ”me”, ’’warm”, ”in”, 
”a”, or “day” in Key (some are deleted as stop words). 

Step 3) Make Horn clauses from documents: For the abductive inference 
in Step 4) of Subsection 3.1, knowledge-base E is formed of Horn clauses. A 
Horn clause is a clause as in Eq.(l), which means that y becomes true under 
the condition that all xi,X 2 , - ■ ■ Xn are true, where variables xi,X 2 , - ■ ■ x„ and y 
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are atoms each of which corresponds to an event occurrence. A Horn clause can 
describe causes {x\,X 2 , ■ ■ ■ , x„) and their effect (y) simply. 



y:-xi,X2,- ■ ■ ,Xn- ( 1 ) 

In combination retrieval, the Horn clause for document describes the 
cause (reading with enough vocabulary knowledge) and the effect (acquiring 
new knowledge from Dx) of reading D^, as: 



cr . , /?2 j * * * j Pmx , ■ (2) 

Here, a is the effect term of D^, which is a term (a word or a phrase) one can 
understand by reading document D^- Pi, P 2 ‘'‘Pmx are the conditional terms 
of Dx, which should be understood for reading and understanding Dx- That is, 
one who knows words Pi, P 2 ' ■ ■ Pmx and reads Dx on this knowledge is supposed 
to acquire knowledge about a. 

The method for taking the effect and the conditional terms from Dx is 
straight-forward. First, the effect terms a^a 2 ,--- are obtained as terms in 
G n {the keywords of Dx). This means that the effect of Dx is expected on 
the user’s interest G, rather than by the intension of the author of Dx- For ex- 
ample, a document about cancer symptoms may work as a description of the 
demerit of smoking, if the reader is a heavy smoker. Focusing the consideration 
onto user’s goal in this way also speeds up the response of combination retrieval 
as in Subsection 5.1. 

Then, the keywords of Dx other than the effect terms above form the condi- 
tional terms Pi, P 2 , ■ ■■ Pmx- As a result, Horn clauses are obtained as 

. Pi , P 2 , ' ' ' Pmx , Dx , 
ey.2 ■ Pi , P 2 , ' ' ' Pmx , Dx , 

(3) 

meaning that one knowing Pi, P 2 , - ■ ■ Pmx can read Dx and understand all the 
effect terms ai, «27 • • • by reading Dx- 

Step 4) Cost based abduction for obtaining the documents to read: We 

employ the cost based abduction (CBA, hereafter) [7], an inference framework for 
obtaining solution h of the least |AT| in Subsection 3.1. In CBA, the causes of a 
given effect G is explained. Formally, CBA is described as extracting a minimal 
hypothesis-set h from a given set H of candidate hypotheses, so that h derives G 
using knowledge S. That is, h satisfies Eq.(4) under Eq.(5) and Eq.(6). We deal 
with S composed of causal rules, expressed in Horn clauses mentioned above. 



Minimize cost{h), under that : 

he H, 

hUEhG, 



(4) 

(5) 

( 6 ) 
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Eq.(4) represents the selection of h to be minimal, i.e., of the lowest-cost 
hypothesis-set h{c H), where cost denoted by cost{h) is the sum of the weights 
of hypotheses in h. The weights of hypotheses in H, the candidates of elements of 
solution h, are initially given. Generally speaking, the weight-values of hypothe- 
ses are closely related to the semantics in the problem to which CBA is applied, 
as exemplified in [8]. In combination retrieval, weights are given differently to 
the two types of hypotheses in H : 

Type 1: Hypothesis that user reads a document in Cdoc 

Type 2: Hypothesis that user knows (have learned) a conditional term in Key 

In giving weights to hypotheses, we considered that user should be able to 
understand the output documents in S, with learning only a small set K of key- 
words from external knowledge other than Cdoc- This is reflected to minimizing 
\K\, the size of K. That is, the weights of hypotheses of Type 2 are fixed to 1 
and ones of Type 1 are fixed to 0, and the content of /i is S' U K. It might 
be good to give values between 0 and 1 to hypotheses of Type 2, each value 
representing the difficulty of learning each term. However, we do not know how 
each word is easy to learn for the user from outside of Cdoc- Further, it might 
seem to be necessary to give positive weights to hypotheses of Type 1, each value 
representing the cost of reading each document. However, this necessity can be 
discounted because we gave mx in Eq. 3 to be proportional to the length of 
D^. That is, the user’s cost (effort) for reading a document is implied by the 
number of meaningful keywords s/he should read in the document. If we sum 
the heterogeneous difficulties, i.e., of reading documents and of learning words, 
the meaning of the solution cost would become rather confusing. 

3.3 An Example of Combination Retrieval’s Execution 

For example, the combination retrieval runs as follows. 

Step 1) Qg = “Does alcohol cause a liver cancer ?” 

Step 2) G is obtained from Qg as {alcohol, liver, cancer}. 

Step 3) From Cdoct documents Di,D 2 , and are taken, each including terms 
in G, and put into Horn clauses as: 



alcohol -.—cirrhosis, cell, disease, D\. 
liver -.—cirrhosis, cell, disease, D\. 
alcohol -.—marijuana, drug, health, D 2 . 
liver -.—marijuana, drug, health, D 2 . 
alcohol -.—cell, disease, organ, D 3 . 
caneer -.—cell, disease, organ, D^. 

Hypothesis-set H is formed of the conditional parts of Di, D 2 and of 
Type 1 each weighted 0, and “cirrhosis,” “cell,” “disease,” “marijuana,” 
“drug,” “health,” and “organ” of Type 2 each weighted 1 . 
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Step 4) h is obtained as S' U K", where 

S = { Di, D 3 } and 

K = {cirrhosis, cell, disease, organ}, 

meaning that user should understand ’’cirrhosis”, ’’cell”, ’’disease” and ’’or- 
gan” for reading Di and £> 3 , served as the answer to Qg. This solution is 
selected because cost{h) (i.e. IK'D takes the values of 4, less than 6 of the only 
alternative feasible solution, i.e. {marijuana, drug, health, cell, disease, 
organ} plus {D2, D^}. 

Step 5 ) The user now reads the two documents presented as: 

Di (including alcohol and liver) stating that alcohol alters the liver func- 
tion by changing liver cells into cirrhosis. 

I ?3 (including alcohol and cancer) showing the causes of cancer in various 
organs, including a lot of alcohol. This document recommends drinkers 
to limit to one ounce of pure alcohol per day. 

As a result, the subject learns that s/he should limit drinking alcohol to keep 
liver healthy and avoid cancer, and also came to know that other tissues than 
liver get cancer from alcohol. 

Thus, user can understand the answer by learning a small number of words 
from outside of Cdoc, as we aimed in employing CBA. More importantly than 
this major effect of combination retrieval, a by-product is that the common 
hypotheses between Di and D3, i.e., {cell, disease} of Type 2 are discovered as 
the context of user’s interest underlying the entered query. This effect is due to 
CBA which obtains the smallest number of involved contexts, for explaining the 
goal (i.e. answering the query), as solution hypotheses. Presenting such a novel 
and meaningful context to the user induces the user to creating new knowledge 
[9], to satisfy his/her novel interest. 

4 MACLOD: Map of Complementary Links of 
Documents 

In the combination retrieval, a user was imposed on two types of tasks that 
reading a obtained document-set and understanding the conditional terms of the 
document-set. However, this tasks are not always easy for a user since the back- 
ground knowledge of a user is different from individuals. For taking such already 
existing knowledge of a user into consideration when generating the document- 
set for reading, we propose a new framework to navigate a user by graphically 
providing with multiple documents of some document-sets each giving an an- 
swer to his/her interest. The concept of knowledge navigation in Section 2 can 
be realized in the framework. 

The implemented system named MACLOD (MAp of Complementary Links 
Of Documents) visualizes these document-sets (each forming a complementary 
document-set) to navigate a user to understanding his/her novel question. The 
process of MACLOD is as follows: 
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The Process of MACLOD 

Phasel. Obtain a plan for knowledge navigation: Obtain a plan (docu- 
ment-set S) for user’s query Qg along the procedure of the combination 
retrieval in Section 3. That is, the process is summarized as follows: 

Step 1) Accept user’s query Qg. 

Step 2) Obtain G, the goal user wants to understand. 

Step 3) Make knowledge-base S for the abduction of Step 4). 

Step 4) Obtain h, the optimal hypothesis-set which derives G if combined 
with S, by cost-based abduction. 

Step 5) Show the obtained documents in Step 4) to the user. 

Phase2. Iterate Phasel to add plans: Iterate Phasel to obtain N sets of 
plans where inconsistency conditions are added to the knowledge-base S in 
Subsection 3.2 to avoid already obtained plans. The inconsistency condition 
to be considered in each cycle of Phasel is described as 

inc ■■■ ,D 

xn 1 (7) 

where D^i, 0 ^ 2 , ■ ■ ■ , D^n are the documents obtained in the previous cycle of 
Phasel. Here, the document included in S more than three times are forced 
not to be included in the next plan. This inconsistency condition, also added 
into knowledge-base S, is described as 

inc:-D^i. (8) 

Where D^i is a document included in S more than three times. The cycles 
of Phasel continues until the number of iterations reaches N. Here, we 
empirically set N as 10. 

Phases. Visualize the plans: MACLOD outputs a 2-dimensional interface in 
which obtained plans during above iterations are mapped. In the interface, 
documents in a plan obtained by one cycle at Phasel are connected with 
links each other in order to support user selects appropriate documents. 
Phase4. Knowledge Navigation: The user goes on reading documents along 
the links in the 2-dimensional interface until s/he understands or gives up 
understanding Qg. 

5 Experimental Evaluations of MACLOD 

5.1 The Experimental Conditions 

MACLOD is implemented in a Celeron 500MHz machine with 320MB mem- 
ory. Although CBA is time-consuming because of its NP-completeness, most 
answers in the experiments were returned within ten seconds from the en- 
try of query by high-speed abduction as in [12]. Queries from users included 
4 or less terms in Key, due to which the response time was below 10 sec. 
This quick response comes also from the goal-oriented construction of Horn 
clauses shown in Subsection 3.2. The document-collection Gdoc of MACLOD is 
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1808 question-answer pairs of Alice, a health care question answering service on 
WWW (http://www.alice.columbia.edu). The small number as 1808 docu- 
ments is a suitable condition for evaluating MACLOD for a sparse document- 
collection which is insufficient for answering novel queries. 

5.2 An Example of MACLOD’s Execution 

When a user entered a query in a word-set or a sentence, MACLOD obtained 
ten plans(document-sets) in Table 1 and showed a 2-dimensional output in Fig- 
ure 5.2. In this case, input {alcohol, fat, calorie} was entered as query Qg for 
knowing if the calorie of alcohol changes into fat. 



Table 1. The top 10 plans for the input query {alcohol, fat, calorie}. 



Ranking 


Plan( document-set) 


Cost 


1 


dl459, d0181 


25 


2 


dl459, d0611 


26 


3 


dl459, d0426 


27 


4 


dl802, d0181 


27 


5 


d0576, d0181 


27 


6 


dl802, d0882, d0611 


39 


7 


dl802, dllOO, d0611 


39 


8 


d0746, d0576, dl466 


39 


9 


dl730, d0576, dl466 


39 


10 


d0746, d0331, dl466 


41 



The process of understanding the user’s interest (shown as Qg) begins by 
reading a document-set dl459 and d0181 (double-circle nodes in Figure 5.2), a 
top ranked plan of MACLOD. The summaries of them are as follows: 

dl459 (including fat and calorie) stating that if the calorie comes short, the 
protein is burned into energy. The lack of protein delays the recovery of 
distress, or weakens the resistance to disease. 
d0181 (including alcohol) stating that drinking too much alcohol damages var- 
ious tissues, especially the liver or the heart. 

After reading these two documents, the user was not satisfy fully his/her interest 
since the documents do not mention the causality between the calorie of alco- 
hol and fat directly. If this does not satisfy one’s interest, then the user begins 
to select and read another documents linked from already read documents for 
getting new information about Qg. MACLOD helps this complementary read- 
ing process with a 2-dimensional interface where a user can piece out the whole 
relations among documents of obtained plans. That is, user can pick other doc- 
ument, that complements already-read documents, for reaching the satisfaction 
of her/himself. 
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Fig. 1. A 2-dimensional interface of MACLOD. Documents are shown as nodes, and 
complementary documents are connected with links. 



The following steps, for example, are as follows. In Figure 5.2, d0611 and 
d0426 are linked from dl459, and c?1802 and d0576 are linked from d0181. Here, 
because the user wanted to know the limit amount of alcohol to drink, the user 
was satisfied by reading d0611 that states the adequate quantity of alcohol per 
day. Also, d0576 stating the ideal quantity of calorie per day satisfied the user 
further because his potential interest was in diet. Thus, MACLOD can supply 
complementary documents step by step according to the user’s interests until 
the user gets satisfied. 



5.3 The Answering System Compared with MACLOD 

We compared the performance of MACLOD with the following typical search 
engine for question answering. We call this search engine here a Vector-based 
FAQ-finder ( VFA Q in short hereafter) . 

The Procedure of VFAQ 

Stepl’) Prepare keyword-vector for each question in Cdoc- 
Step2’) Obtain keyword- vector vq for the current query Qg. 
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Steps’) Find the top N' key word- vectors prepared in 1’), in the decreasing 

order of product value Vx ■ vq, and return their corresponding answers. 

Here, a key word- vector for a query Q is formed as follows: Each vector 
has \Key\ attributes {Key was introduced in 3.2 as the candidate of keywords 
in Cdoc)j each taking the value of TFIDF[6] in Q, of the corresponding key- 
word. Each vector v is normalized to make |u| = 1. For example, for query Qg 
{alcohol, warm} (or a question which is put into G: {alcohol, warm}), the 
vector comes to be (0, 0.99, 0, •••, 0, 0.14, 0, 0, •• •) where 0.99 and 0.14 are the 
normalized TFIDF values of “alcohol” and “warm” in Qg. Elements of value 0 
here correspond to terms which are in Key but not included in Qg. Supplying 
N' documents in Step 3’) is for setting the condition similar to MACLOD so 
that a fair comparison becomes possible. 



5.4 Result Statistics 

The experiment was executed for 5 subjects from 21 to 30 years old. This means 
that the subjects were of the near age to the past question askers of Alice. 

A popular method for evaluating the performance of a search engine is to 
see recall (the number of relevant documents retrieved, divided by the number 
of relevant documents to user’s query in Cdoc) and precision (the number of 
relevant documents retrieved, divided by the number of retrieved documents). 
However, this traditional manner of evaluation is not appropriate for evaluating 
MACLOD, because it does not output a sheer list of most relevant documents 
to the query. In the traditional evaluation, it was regarded as a success if user 
gets satisfied by reading a few documents which are highly ranked in the output 
list. On the other hand, MACLOD aims at satisfying a user who reads some 
documents along the pathways, rather than a few best document. Therefore, 
this section presents an original way of evaluation for MACLOD. 

Here, 42 queries were entered. This seems to be quite a small number for 
the evaluation data. However, we compromised with this size of data because we 
aimed at having each subject evaluate the returned answer in a natural manner. 
That is, in order to have the subject report whether s/he was really satisfied 
with the output, the subject must enter his/her real fareinterest. Otherwise, the 
subject has to imagine an unreal person who asks the rare query and imagine 
what the unreal person feels with the returned answers. Therefore we restricted 
to a small number of queries entered from real novel interests. 

The overall result was shown in Figure 5.4. The horizontal axis means the 
number of documents read in series and the vertical axis means the number of 
satisfied queries. According to the subjects, MACLOD did better than VFAQ, 
especially for novel queries. For a; = 1, MACLOD and VFAQ equally satisfied 16 
queries. On the other hand, for a; = 2, MACLOD satisfied 12 queries, whereas 
VFAQ satisfied 4 queries. And for a; = 3, MACLOD satisfied 6 queries, whereas 
VFAQ satisfied 3 queries. Finally, fot x > 4, MACLOD and VFAQ satisfied 
3 queries. Thus, the superiority of MACLOD for x greater than 1 came to be 
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1.34 
Tha numbef of documents read in series 

Fig. 2. Statistical results. 



apparent. In all cases, VFAQ obtained redundant documents, i.e., document of 
similar contexts, equally relevant to the query. 

These results can be summarized that novel queries for Cdoc were answered 
satisfactory by MACLOD. Answers in the form of document-combination visual- 
ized by MACLOD came to be easy to read and browse along the links according 
to the subject, and the presented answers were meaningful for the user. 



5.5 Comparison with Other Methods 

Among the rare systems which combine documents for answering novel query. 
Hyper Bridges[10] and NaviPlan[ll] produce a plan of user’s reading of docu- 
ments. They present a plan made of sorted multiple documents, and a user who 
reads them in the order as sorted by Hyper Bridges or NaviPlan incrementally 
refines one’s own knowledge until one learns the meaning of the entered query. 
A plan made by these tools is a serial set of documents, which guides the user 
to an understanding of query starting from a beginner’s knowledge, in the or- 
der presented by the system. As a result, neither NaviPlan nor Hyper Bridges 
they can obtain an appropriate document to be read last, i.e., the document to 
directly reach the goal (i.e. answer the query), in all the examples above where 
multiple documents are required to be mixed to answer the query. On the other 
hand, the combination retrieval and its advanced version MACLOD makes a 
complementary set of documents, supplementing the content of each other for 
giving a satisfactory answer as a whole. User may read documents in an obtained 
document-set in any order as s/he likes. Especially, MACLOD gives user more 
flexible search interface than the original combination retrieval. 

Let us here show the merit of MACLOD compared with the previous combi- 
nation retrieval. In short, the merit is that user can select documents matching 
with his/her interest, reactively reflecting the context of documents read already. 



Knowledge Navigation on Visualizing Complementary Documents 269 



The fair extension of the combination retrieval to be compared with MACLOD 
is to have as many document-sets as obtained in MACLOD. In such an out- 
put style, it will be difficult to control the context of the documents to read. 
That is, the order of sets sorted on cost does not always correspond to users 
interest and often bothers user with hard to read the document-sets in an un- 
desired order. In this example, if the user feels dl459 mismatching to his/her 
context, s/he will not reach any satisfactory document-set in the list. Neither 
a MACLOD-like style output as in Figure5.2 makes things better, in this case 
because dl459 is shared by all the sets. In all trials for obtaining and showing 
highly ranked document-sets of the combination retrieval, the user was fixed to 
the context bound by a centraldocument as dl459 whether desiring or not the 
situation. From this problem with the combination retrieval, we can point the 
two- fold merit of MACLOD. 

1 . Due to discarding documents already appeared many times in the out- 

put document-sets in the process (see Section 4), MACLOD can include 
document-sets of various contexts in the output. This enables the user to 
choose suitable contexts reactively in the search process. 

2 . The graphical output makes the context-control easier, because the links be- 

tween nodes (documents) represent the complemantary relations (i.e., as 
documents to be read together) between contexts. If user feels a document 
misleading to him, s/he can open a document linked from the current doc- 
ument without feeing a sudden departure from the current context. 

6 Conclusions 

The combination retrieval, a method to obtain a set of documents for answering 
a novel query is fully described and its visual interface MACLOD is introduced. 
Combination retrieval presents user with a set of, not a single, documents for 
answering a new query unable to be answered by one past answer to a past query. 
The MACLOD interface supplies a user with a further comfort in acquiring novel 
knowledge. MACLOD allows user to efficiently alter a part of the reading-plan 
(i.e. document-set) s/he is currently following, improving his/her satisfaction. 
This effect works especially if the interest is novel i.e., if the context is too 
particular to be captured by past Q&A’s. 

References 

1. Hadamard, J: The Psychology of Invention in the Mathematical Field. Princeton 
University Press, 1945. 

2. Swanson, D.R., Smalheiser, N.R.: An Interactive System for Complementary Liter- 
atures: a Stimulus to Scientific Discovery. Artificial Intelligence, Vol. 91, 183-203, 
1997. 

3. Matsumura, N., and Ohsawa, Y.: Combination Retrieval for Creating Knowledge 
from Sparse Document Collection, Proc. of Discovery Science, 320-324, 2000. 




270 



N. Matsumura, Y. Ohsawa, and M. Ishizuka 



4. Brookes, B. C.: The foundations of information science, Journal of Information 
Science, 2, 125-133, 1980. 

5. Porter, M.F.: An Algorithm for Suffix stripping. Automated Library and Informa- 
tion Systems, Vol.l4, No. 3, 130-137, 1980. 

6. Salton, G. and Buckey, C.: Term- Weighting Approach in Automatic Text Retrieval, 
Reading in Information Retrieval, 323-328, 1998. 

7. E. Charniak and S.E. Shimony: Probabilistic Semantics for Cost Based Abduction. 
Proe. of AAAI-90, 106-111, 1990. 

8. Ohsawa, Y. and Yachida, M.: An Index Navigator for Understanding and Express- 
ing User’s Coherent Interest, Proc. of IJCAI-97, 1: 722-729, 1997. 

9. Nonaka,!. and Takeuchi, H.: The Knowledge Creating Company, Oxford University 
Press, 1995. 

10. Ohsawa, Y., Matsuda, K. and Yachida, M.: Personal and Temporary Hyper 
Bridges: 2-D Interface for Undefined Topics, J. Computer Networks and ISDN 
Systems, 30: 669-671, 1998. 

11. Yamada, S. and Osawa, Y.: Planning to Guide Concept Understanding in the 
WWW. AAAI-98 Workshop on AI and Data Integration, 121-126, 1998. 

12. Ohsawa, Y. and Ishizuka, M.: Networked Bubble Propagation: A Polynomial-time 
Hypothetical Reasoning Method for Computing Near-optimal Solutions, Artificial 
Intelligence, Vol.91, 131-154, 1997. 




KeyWorld: Extracting Keywords from 
a Document as a Small World 



Yutaka Matsuo^’^, Yukio Ohsawa^’^, and Mitsuru Ishizuka^ 

^ University of Tokyo, Kongo 7-3-1, Bunkyo-ku, Tokyo 113-8656, JAPAN, 
matsuoSmiv. t .u-tokyo . ac . jp, 
http : //www.miv . t .u-tokyo . ac . jp/~matsuo/ 

^ Japan Science and Technology Corporation, Tsutsujigaoka 2-2-11, Miyagino-ku, 
Sendai, Miyagi, 983-0852, JAPAN, 

® University of Tsukuba, Otsuka 3-29-1, Bunkyo-ku, Tokyo 113-0012, JAPAN, 



Abstract. The small world topology is known widespread in biological, 
social and man-made systems. This paper shows that the small world 
structure also exists in documents, such as papers. A document is repre- 
sented by a network; the nodes represent terms, and the edges represent 
the co-occurrence of terms. This network is shown to have the charac- 
teristics of being a small world, i.e., nodes are highly clustered yet the 
path length between them is small. Based on the topology, we develop 
an indexing system called KeyWorld, which extracts important terms by 
measuring their contribution to the graph being small world. 

1 Introduction 

Graphs that occur in many biological, social and man-made systems are often 
neither completely regular nor completely random, but have instead a “small 
world” topology in which nodes are highly clustered yet the path length between 
them is small [11] [10]. For instance, if you are introduced to someone at a party 
in a small world, you can usually find a short chain of mutual acquaintances 
that connects you together. In the 1960s, Stanley Milgram’s pioneering work on 
the small world problem showed that any two randomly chosen individuals in 
the United States are linked by a chain of six or fewer first-name acquaintances, 
known as “six degrees of separation” [5]. Watts and Strogatz have shown that 
a social graph (the collaboration graph of actors in feature films), a biological 
graph (the neural network of the nematode worm C. elegans), and a man-made 
graph (the electrical power grid of the western United States) all have a small 
world topology [11] [10]. The World Wide Web also forms a small world network 
[ 11 - 

In the context of document indexing, an innovative algorithm called Key- 
Graph [6] is developed, which utilizes the structure of the document. A document 
is represented as a graph, each node corresponds to a term,^ and each edge cor- 
responds to the co-occurrence of terms. Based on the segmentation of this graph 

^ A term is a word or a word sequence. 
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into clusters, KeyGraph finds keywords by selecting the term which co-occurs 
in multiple clusters. Recently, KeyGraph has been applied to several domains, 
from earthquake sequences [7] to register transaction data of retail stores, and 
showed remarkable versatility. 

In this paper, inspired by both small world phenomenon and KeyGraph, we 
develop a new algorithm, called Key World, to find important terms. We show 
first that the graph derived from a document has the small world characteristics. 
To extract important terms, we find those terms which contribute to the world 
being small. The contribution is quantitatively measured by the difference of 
“small-worldliness” with and without the term. 

The rest of the paper is organized as follows. In the following section, we 
first detail the small world topology, and show that some documents actually 
have small world characteristics. Then we explain how to extract the important 
terms in Section 3. We evaluate Key World and suggest further improvements in 
Section 4. Finally, we discuss future works and conclude this paper. 

2 Term Co-occurrence Graph and Small World 

2.1 Small- Worldliness 

We treat an undirected, unweighted, simple, sparse and connected graph. (We 
expand to an unconnected graph in Section 3.) To formalize the notion of a small 
world, Watts and Strogatz define the clustering coefficient and the characteristic 
path length [11] [10]: 

— The characteristic path length, L, is the path length averaged over all pairs 
of nodes. The path length d{i,j) is the number of edges in the shortest path 
between nodes i and j. 

— The clustering coefficient is a measure of the cliqueness of the local neigh- 
borhoods. For a node with k neighbors, then at most kC 2 = k{k — l)/2 
edges can exist between them. The clustering of a node is the fraction of 
these allowable edges that occur. The clustering coefficient, C is the average 
clustering over all the nodes in the graph. 

Watts and Strogatz define a small world graph as one in which L > Lrand 
(or L « Lrand) and C ^ Grand where Lrand and Grand are the characteristic 
path length and clustering coefficient of a random graph with the same number 
of nodes and edges. They propose several models of graphs, one of which is 
called /3-Graphs. Starting from a regular graph, they introduce disorder into the 
graph by randomly rewiring each edge with probability p as shown in Fig.l. If 
p = 0 then the graph is completely regular and ordered. If p = 1 then the graph 
is completely random and disordered. Intermediate values of p give graphs that 
are neither completely regular nor completely disordered. They are small worlds. 
Walsh defines the proximity ratio 



P’ — {G / li) / [Grand! Grand!) 



( 1 ) 




KeyWorld: Extracting Keywords from a Document as a Small World 



273 



Regular 



Small world 




Random 



p=0 



Increasing randomness 



p=l 





Fig. 1. Random rewiring of a regular ring lattice. 



Table 1. Characteristic path lengths L, clustering coefficients C and proximity ratios 
p for graphs with a small world topology [9] (studied in [11])). 





L 


Lrand 


C 


Crand 


d 


Film actor 


3.65 


2.99 


0.79 


0.00027 


2396 


Power grid 


18.7 


12.4 


0.080 


0.005 


10.61 


C. elegans 


2.65 


2.55 


0.28 


0.05 


4.755 



The graphs are defined as follows. For the film actors, two actors are joined by an edge 
if they have acted in a film together. For the power grid, nodes represent generators, 
transformers and substations, and edges represent high-voltage transmission lines be- 
tween them. For C. elegans, an edge joins two neurons if they are connected by either 
a synapse or a gap junction. Because the number of nodes and edges for each graph is 
different, the magnitude of L, C and p differs. 



as the small- worldliness of the graph [9]. As p increases from 0, L drops sharply 
since a few long-range edges introduce short cuts into the graph. These short cuts 
have little effect on C. As a consequence the proximity ratio p rises rapidly and 
the graph develops a small world topology. As p approaches 1, the neighborhood 
clustering starts to break down, and the short cuts no longer have a dramatic 
effect at linking up nodes. C and p therefore drop, and the graph loses its small 
world topology. In Table 1, we can see p is large in the graphs with a small world 
topology. 

In short, small world networks are characterized by the distinctive combina- 
tion of high clustering with short characteristic path length. 

2.2 Term Co-occurrence Graph 

A graph is constructed from a document as follows. We first preprocess the 
document by stemming and removing stop words, as in [8] . We apply an n-gram 
to count phrase frequency. Then we regard the title of the document, each section 
title and each caption of figures and tables as a sentence, and exclude all the 
figures, tables, and references. We get a list of sentences, each of which consists 
of words (or phrases). In other words, we get basket data where each item is a 
term, discarding the information of term orderings and document structures. 
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Table 2. Statistical data on proximity ratios p for 57 graphs of papers in WWW9. 





L 


Lrand 


C 


Crand 


h 


Max. 


4.99 


3.58 


0.38 


0.012 


22.81 


Ave. 


5.36 


— 


0.33 


— 


15.31 


Min. 


8.13 


2.94 


0.31 


0.027 


4.20 



We set fthre = 3. We restrict attention to the giant connected component of the graph, 
which include 89% of the nodes on average. We exclude three papers, where the giant 
connected component covers less than 50% of the nodes. We don’t show the Lrand and 
Grand for the average case, because n and k differs dependent on the target paper. On 
average, n = 275 and k = 5.04. 



Then we pick up frequent terms which appear over a user-given threshold, 
fthre times, and fix them as nodes. For every pair of terms, we count the eo- 
oceurrence for every sentence, and add an edge if the Jaccard coefficient exceeds 
a threshold, Jthre-^ The Jaccard coefficient is simply the number of sentences 
that contain both terms divided by the number of sentences that contain either 
term. This idea is also used in constructing a referral network from WWW pages 
[4]. We assume the length of each edge is 1. 

Table 2 shows statistics of the small-worldliness of 57 graphs, each con- 
structed from a technical paper that appeared at the 9th international World 
Wide Web conference (WWW9) [12]. From this result, we can conjecture these 
papers certainly have small world structures. However, depending on the paper, 
the small-worldliness varies. 

One reason why the paper has a small world structure can be considered 
that the author may mention some concepts step by step (making the clustering 
of related terms), and then try to merge the concepts and build up new ideas 
(making a ‘shortcut’ of clusters). The author will keep in mind that the new 
idea is steadily connected to the fundamental concepts, but not redundantly. 
However, as we have seen, the small-worldliness varies from paper to paper. 
Certainly it depends on the subject, the aim, and the author’s writing style of 
the paper. 



3 Finding Important Terms 

3.1 Shortcut and Contractor 

Admitting that a document is a small world, how does it benefit us? We try 
here to estimate the importance of a term, and pick up important terms, though 
they are rare in the document, based on the small world structure. We consider 
‘important terms’ as the terms which reflect the main topic, the author’s idea, 
and the fundamental concepts of the document. 

In this paper, we set Jthre so that the number of neighbors, k, is around 4.5 on 
average. 



2 
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First we introduce the notion of a shortcut and a contractor, following the 
definition in [10]. 

Definition 1. The range R{i,j) is the length of the shortest path between i and 
j in the absence of that edge. If R{i,j) > 2, then the edge (i,j) is called a 
shortcut. 

Applying the notion of “shortcuts” in terms of nodes, we can get the definition 
of “contractor.” 

Definition 2. If two nodes u and w are both elements of the same neighborhood 
r{v), and the shortest path length between them that does not involve any edges 
adjacent with v is denoted dy{u,w) > 2, then v is said to contract u and w, and 

V is called a contractor. 

In our first thought, if dv{u, w) is large, the corresponding term of contractor 

V might be interesting, because they bridge the distant notions which rarely 
appear together. However, such a node sometimes connects the nodes far from 
the center of the graph, i.e. the main topic of the document. Below we take into 
account the whole structure of the graph, calculating the contribution of a node 
to make the world small. 

To treat the disconnected graph, we expand the definition of path length 
(though Watts restricts attention to the giant connected component^ of the 
graph) . 

Definition 3. An extended path length d'{i,j) of node i and j is defined as 
follows. 

,,,. (d(i,j), if(i,j) are connected, 

l^Wsum, otherwise. 

where Waum is a constant, the sum of the widths of all the disconnected sub- 
graphs. d{i,j) is a path length of the shortest path between i and j in a connected 
graph. 

If some edges are added to the graph and some parts of the graph gets 
connected, d'{i,j) will not increase, unless the length of an edge is negative. 
Thus d'{i,j) is one of the upper bounds of the path length considering the edges 
will be added. 

Definition 4. An extended characteristic path length L' is an extended path 
length averaged over all pairs of nodes. 



Definition 5. L), is an extended path length averaged over all pairs of nodes 
except node v. L'q is the extended characteristic path length of the graph without 
node V. 

® A connected component of a graph is a set of nodes such that each node pair has a 
path. A connected component is called a giant connected component if it contains 
more than 50% of the nodes in the graph. 




276 



Y. Matsuo, Y. Ohsawa, and M. Ishizuka 



Table 3. Frequent terms in this paper. 



Term 


Frequency 


term 


39 


small 


36 


world 


35 


graph 


33 


small world 


27 


node 


26 


document 


25 


length 


20 


important 


19 


paper 


18 



Table 4. Terms with 10 largest CBv in this paper. 



Term 


CB^ 


Frequency 


small world 


4.38 


27 


contribution 


3.11 


11 


node 


2.98 


26 


list 


2.24 


8 


author 


1.36 


7 


table 


1.10 


8 


important term 


0.80 


11 


show 


0.72 


6 


structure 


0.44 


7 


Key World 


0.44 


10 



In other words, L'^ is the characteristic path length regarding the node u as a 
corridor (i.e., a set of edges). For example, if v is neighboring u, w, and 2 , then 
{u,w),{u, z),and{w, z) are considered to be linked. And L'q is the extended 
characteristic path length assuming the corridor doesn’t exist. 

Definition 6. The contribution, CB^, of the node v to make the world small is 
defined as follows. 

CB, = L'a^ - l; (3) 

We don’t pay attention to the clustering coefficient, because adding or eliminat- 
ing one node affects the clustering coefficient little. 

If node V with large CB^ is absent in the graph, the graph gets very large. In 
the context of documents, the topics are divided. We assume such a term help 
merge the structure of the document, thus important. 
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Table 5. Pairs of terms with 10 largest C-Be- 



Pair 


CBe 


node - eontribution 


2.97 


list - table 


1.47 


contribution - important term 


1.20 


table - show 


1.10 


contribution - structure 


0.87 


KeyWorld - list 


0.87 


important term - develop 


0.79 


network - show 


0.72 


contribution - make 


0.47 


author - idea 


0.47 



3.2 Example 

We show the example experimented on this paper, i.e., the one you are reading 
nowd Table 3 shows the frequent terms and Table 4 shows the important terms 
measured by CBy. Comparing two tables, the list of important terms includes 
the author’s idea, e.g., “important term” and “KeyGraph,” as well as the impor- 
tant basic concept, e.g., “structure,” although they are not frequently appeared. 
However the list of frequent terms simply show the components of the papers, 
and are not of interest. 

We can also measure the contribution of an edge, CBe, to make the world 
small, defined similarly as CBy. However, if we look at the pairs of terms in Table 
5, it is hard to understand what they suggest. There are numbers of relations 
between two terms, so we cannot imagine the relation of the pairs right away. 

Lastly, Fig. 2 shows the graphical visualization of the world of this paper. 
(Only the giant connected component of the graph is shown, though other parts 
of the graph is also used for calculation.) We can easily point out the terms with- 
out which the world will be separated, say “small world” and “comtribution” . 

4 Evaluation and Improvements 

This section describes an evaluation of KeyWorld as an indexing system. Key- 
World is not merely an indexing system but it provides an understandable graph- 
ical representation of the document. However, we restrict attention here to the 
performance of KeyWorld as an indexing tool to compare it with existing index- 
ing techniques such as tf and tfidf. The tf measures simply term frequency. The 
tfidf measure is obtained by using the product of the term frequency and the 
inverse document frequency [8].® 

^ We ignore the effect of self-reference’, it’s sufficiently small. 

® We use log N/n^ as idf where N is the number of document collection, and n„ is 
the number of document which includes term v. 
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When an author writes a paper, he/she annotates keywords to his/her paper 
by selecting the category of the paper (e.g. “text mining”), utilized algorithms 
(e.g. “small world”), or the proposed method (e.g. “KeyWorld”). The choice 
depends on the author’s criteria. In our definition, a keyword is an important 
term in the document, which reflects the main topic, the author’s idea, and the 
fundamental concepts of the document. For example, considering this paper, 
we think “small world,” “document,” “contribution,” “important term,” “path 
length,” and “KeyWorld” are keywords, and “node,” “make,” and “text mining” 
are not keywords because they are too trivial or too broad, or do not occur in 
this document. 

In the experimentation, we asked the authors of 20 technical papers in the 
artificial intelligence held to judge whether some terms in their papers are key- 
words or not by a questionnaire. For each document, we first get top 15 weighted 
terms by tf, tfidf,^ KeyGraph, and KeyWorld, i.e. the four lists of 15 terms. (We 
denote the list by method a as lista-) We merge the four lists and shuffle the 
terms. Then we ask the author whether each term is a keyword or not after ex- 
plaining the definition of keywords. Counting the number of authorized terms, 
we can get the precision of method a as follows. 

Number of authorized terms in lista 

precisioua = r tt (4) 

iNumber of terms in lista 



® As a corpus, we used 166 papers in Journal of Artificial Intelligence Research, from 
Vol.l in 1993 to Vol.l4 in 2001. 
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Table 6. Precision and coverage 





tf 


KeyWorld 


tfidf 


KeyWorld+idf 


precision 


0.53 


0.49 


0.55 


0.71 


coverage 


0.48 


0.50 


0.62 


0.68 



Table 7. Terms with 10 largest CBv x idfv in this paper. 



Term 


CB„ X idfv 


Frequency 


small world 


4.57 


27 


important term 


3.82 


11 


co-occurrence 


1.89 


4 


KeyWorld 


1.58 


10 


short eut 


1.56 


4 


actor 


0.89 


5 


shortest path 


0.66 


4 


sentence 


0.66 


4 


document 


0.66 


23 


path length 


0.59 


17 



Next, from the shuffled list of all terms, the authors are told to pick 5 
(or more) terms as indispensable terms which they think are essential to the 
document, and cover the most important concepts of the paper. We calculate 
the coverage of method a as follows. 



coveragca = 



Number of indispensable terms in lista 
Number of indispensable terms 



( 5 ) 



The results are shown in Table 6. The performance of KeyWorld is not good 
enough. The precision and coverage are almost equal to tf. However, we feel 
that the term list by KeyWorld includes very important terms as well as very 
dull words, e.g. “show” or “table” in Table 4. To sieve out these dull terms, we 
develop an improved weighting method, which annotates term v with the weight 



CBy X idfy^ 



( 6 ) 



where idfv is an zd/ measure for term v. The improved results are also shown in 
Table 6. Both the precision and coverage are now far better than tfidf. Table 7 
shows the top 10 terms by KeyWorld with zd/ factor for this paper. 

In summary. Key World can often find important terms, however, it also de- 
tect less important terms. By incorporating with the zd/ measure, KeyWorld can 
be a very good indexing tool. 

^ If the author remembers the other terms, he/she is permitted to add them to the 
list. 
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5 Discussion 

The small world phenomenon was inaugurated as an area of experimental study 
in the social sciences by Stanley Milgram in the 1960’s. Since then, numerous 
studies have been done for network analysis. The importance of weak ties, which 
is a short cut between clusters of people, was mentioned 30 years ago [3] . 

The measure of contribution is similar to “ centraliti/’ in the context of social 
network study. Centrality can be measured in a number of ways [2]. Considering 
an actors’ social network, the simplest is to count the number of others with 
whom an actor maintains relations. The actor with the most connections, i.e., 
the highest degree, is most central. Another measure is closeness, which calculates 
the distance from each person in the network to each other person based on the 
connections among all members of the network. Central actors are closer to all 
others than are other actors. A third measure is betweenness which examines 
the extent to which an actor is situated between others in the network, i.e., 
the extent to which information must pass through them to get to others, and 
thus the extent to which they will be exposed to information circulating in the 
network. However, our measure of contribution has a characteristic in that it 
calculates the difference of the closeness of all nodes with and without a certain 
node. It measures a node’s contribution to the whole structure by temporarily 
eliminating the node. 

6 Conclusion 

Watts mentions in [10] the possible applications of small world research, includ- 
ing “the train of thought followed in a conversation or succession of ideas leading 
to a scientific breakthrough.” In this paper, we have focused on technical papers 
rather than a conversation or succession of ideas. The future direction of our 
research is to treat directed or weighted graphs for finer analyses of documents. 

We expect our approach is effective not only to document indexing, but also 
to other graphical representations. To find out structurally important parts may 
bring us deeper understandings of the graph, new perspectives, and chances to 
utilize it. We are interested in a big structural change caused by a small change 
of the graph. A change, which makes the world very small, may sometimes be 
very important. 
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Abstract. Recommendation of representative Web pages of specific topic is 
important for assisting users’ information retrieval from the Web. This paper 
describes a method for discovering such pages by purifying Web communities 
using connectivity information of hyperlinks. A complete bipartite of Web 
graph, which is composed of centers (containing useful information regarding a 
topic) and fans (containing hyperlinks to centers), can be regarded as a Web 
community sharing a common interest. The method is based on the assumption 
that most of the fans contain hyperlinks pointing to representative pages re- 
garding the topic. In the method, both fans and centers are renewed iteratively 
by the result of the majority vote of the members of previous community. Ex- 
perimental results show that our method has abilities of finding representative 
pages for some topics only from a few input URLs. 



1 Introduction 

The number of Web pages in the world surpasses 2 billion documents as of July 2000. 
In order to retrieve useful information from such huge Web network, methods for 
discovering related Web pages are necessary. Although keyword-based search engines 
are very popular now, they often find difficulty because of the synonymy and the 
polysemy of natural languages. Several researches of Web mining based on hyperlink 
information, which is called Web structure mining, are attempted since they have 
abilities of processing huge amount of Web pages compared with other content-based 
Web mining approaches. 

As Broder pointed out [2], there are the following reasons and goals for the re- 
search of Web structure mining: 

1 . Designing crawl strategies on the Web 

2. Understanding of the sociology of content creation on the Web 

3. Analyzing the behavior of Web algorithms that make use of link information 

4. Predicting the evolution of Web structures such as bipartite cores and webrings, and 
developing better algorithms for discovering and organizing them 

5. Predicting the emergence of important new phenomena in the Web graph 

The author has been working on the research of Web structure mining in order to 
achieve some of the above goals. A Web visualization system [10] shows the relation 
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of input Web pages in the form of graph in which related pages are located close to 
each other. Another attempt is a Web community discovery system [11] that finds 
related Web pages based on the assumption that pages composing a complete bipartite 
graph are regarded as a community sharing a common interest. 

This paper focuses on a topic shared by a Web community, and proposes a method 
for purifying Web communities in order to find representative Web pages regarding 
the topic of input pages. It often happens that Web surfers who already know some 
Web pages about specific topic want to find more representative pages about the same 
topic. Finding representative Web pages is important for assisting users’ information 
retrieval from the Web. 

The method proposed in this paper is based on the graph structure of hyperlinks, 
and it is an extension of Web community discovery method proposed by the author 
[11]. In this method, the set of Web pages which compose a complete bipartite graph 
are renewed iteratively by the majority vote of previous members. The procedure is 
based on the assumption that most of the Web pages that contain hyperlinks pointing 
to the pages of some specific topic contain hyperlinks to representative pages about 
the topic. It is expected that such representative pages will be acquired by the iterative 
majority vote of the members of previous Web communities. The author has devel- 
oped a system based on this purification method. The system succeeds in finding rep- 
resentative pages for some of the Web communities only from hyperlink information. 
Sometimes, the system outputs unexpected Web pages that are different from the topic 
of input Web pages. Such results are shown and analyzed in the section of experi- 
mentation. 

2 Related Work 

As the examples of Web structure mining, which focus on the graph structure of hy- 
perlinks, HITS [7], Web Trawling [9], and PageRank [12] are famous ones. HITS is 
an algorithm for topic-dependent ranking. In this algorithm, authority and hub are 
employed as the criteria for evaluating usefulness of each Web page. A hub page on a 
topic is a page that has hyperlinks to many other pages on that topic, in other words, a 
page that links to many authorities on the topic. A good authority is a page that is 
pointed by many good hubs, while a good hub is a page that points to many authori- 
ties. For each Web page, authority weights and hub weights are calculated as follows: 

1. Sampling step: A focused collection of several thousand Web pages likely to be 
rich in relevant authorities is generated. First, HITS algorithm constructs a sub- 
graph expected to be rich in relevant authoritative pages, in which it will search for 
hubs and authorities. To construct this subgraph, the algorithm uses keyword que- 
ries to collect a root set of about 200 pages from a traditional index-based search 
engine. Since many of these pages are presumably relevant to the search topic, 
some of the pages are expected to contain links to prominent authorities, and others 
to be linked to by prominent hubs. The root set is therefore expanded into a base set 
by including all pages that linked to by pages in the root set, and all pages that link 




284 



T. Murata 



to a page in the root set. Our attention is restricted to this base set for the remainder 
of this algorithm. This base set typically contains roughly 1000-3000 pages, and a 
large number of authoritative pages for the search topic are expected to be in this 
set. 

2. Modification step: Hyperlinks between two pages on the same Web site very often 
serve a purely navigational function, and typically do not represent conferral of 
authority. All such hyperlinks are deleted from the subgraph induced by the base 
set, and apply the remainder of the algorithm to this modified subgraph. 

3. Weight-propagation step: The algorithm associates a non-negative authority weight 
Xjj and non-negative hub weight y^^ with each page p. All x- and y-values are set to a 
uniform constant initially. (The final results are essentially unaffected by this ini- 
tialization.) The authority and hub weights are updated as follows: If a page is 
pointed to by many good hubs, we would like to increase its authority weight. Thus, 
for a page p, the value of x^^ is updated to be the sum of y^^ over all pages q that link 
to p: 



Xp = Sum(yq ) {q such that q->p} 

where q->p indicates that q links to p. In the same manner, if a page points to many 
good authorities, its hub weight is increased: 

Yp = Sum(xq ) {q such that p->q} 

HITS is a simple algorithm based solely on hyperlink information except the acqui- 
sition of a root set, and its behavior is analyzed by several researchers. HITS tends to 
generalize topics that are not sufficiently broad, which is called topic generalization 
[5]. There are several works for distilling topics of Web communities by using this 
phenomena [1] [3]. 

Another point that should be mentioned is that HITS sometimes outputs hubs and 
authorities which are irrelevant to input topic. When a good hub page of a community 
contains hyperlinks pointing to pages of several topics, pointed pages irrelevant to 
input topic may have much authoritative weight and regarded as an authoritative page 
of the community. Such phenomenon is called topic drift [6]. Another phenomenon is 
inadvertent topic hijacking [4], when a base set contains a number of Web pages from 
the same Web site. Since such pages often contain hyperlinks pointing to the same 
URL (for example, the top page of the site), authority weight of irrelevant pages may 
be increased. 

Several attempts have been made in order to avoid such phenomena, such as using 
anchor texts and giving weight to hyperlinks [3], and pruning irrelevant pages from 
base set in advance to the calculation of authority /hub weights[l]. However, it is con- 
sidered that the fundamental issue of such undesirable behavior of HITS algorithm lies 
in the generation of base set. In HITS algorithm, base sets are generated by collecting 
neighboring pages of a root set, which is acquired from the result of keyword-based 
search engine. The algorithm is based on the assumption that many good authority/hub 
pages are included in the base set which are generated in the above way. However, 
this assumption is not always true. Since HITS focuses its attention on the pages of 
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base set in the process of ranking, its results are heavily dependent on the quality of 
the base set. On the other hand, Murata’s Web community discovery method [11] 
acquires data in the process of discovery. The goal of the method is to find a complete 
bipartite graph which is composed of centers (informative pages) and fans (pages 
containing hyperlinks to centers), and data acquired from a search engine and Web 
servers are used for renewing the member of centers and fans iteratively. Since the 
quality of data can be improved by data acquisition in the process of discovery, the 
method is expected to avoid the above weakness that HITS suffers. 

This paper proposes a new method for purification of Web community, which is a 
modified version of the above method. Members of fans and centers are changed 
iteratively by a kind of majority vote of each other. In this manner, members of the 
communities are purified so that representative fans and authorities are acquired. 



3 A Method for Purifying Web Communities 

A method for discovering Web community [11], which is the base of our new method 
in this paper, is explained first. The method consists of the following three steps: 

1 . Search of fans using a search engine 

2. Addition of a new URL to centers 

3. Repetition of step 1 and step 2 

Figure 1 shows the steps of the community discovery method. In the method, some 
input URLs are accepted as initial centers, and fans which co-refer all of the centers 
are searched. As shown in the figure, fans are searched from centers by backlink 
search on a search engine. The next step is to add a new URL to centers based on the 
hyperlinks included in acquired fans. The fans’ HTML files are acquired through the 
internet, and all the hyperlink contained in the files are extracted. The hyperlinks are 
sorted in the order of frequency. Since hyperlinks to related Web pages often co- 
occur, the top-ranking hyperlink is expected to point to a page whose contents are 
closely related to the centers. Therefore, the URL of the page is added as a new mem- 
ber of centers. 

In a method for purifying Web communities, which is newly proposed in this paper, 
the above steps of renewing fans and centers are modified in the following way: 

• If there are few fans which co-refer all the members of centers, one of the members 
of centers are randomly removed and then corresponding fans are searched by 
backlink search so that the number of fans will be more than a certain threshold. 

• After all the hyperlinks contained in fans’ HTML files are extracted, they are sorted 
in the order of frequency. Then a few URLs of high-ranking hyperlinks are added 
to the centers and the same numbers of low-ranking URLs that were the members 
of previous centers are removed from the centers. This means that centers are up- 
dated according as the references of corresponding fans. The number of addi- 
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tion/removal of URLs is limited up to half of the number of centers so that the topic 
of centers will not change drastically. 



J3L input URLs 

(l)search 
backlinks 

fans centers 

(2)the most frequent 
URL is added 

(3) repeat (1) and (2) 

Fig. 1. A method for discovering Web communities 





With these modifications, the following effects are expected: 

• Even if some irrelevant pages are contained in centers, the quality of fans will not 
be deteriorated since pages that refer most of the centers (not all of them) are 
searched and regarded as fans. 

• Since the URLs that are linked by many of fans are considered to be representative 
ones regarding the topic of Web community, replacing the members of centers with 
high-ranking URLs is expected to improve the quality of centers. 



4 Experiments 

Based on the above method, the author has built a system for purifying Web commu- 
nities. As the input to the system, bottom five URLs that are listed in the topics of 
100hot.com ( http://www. 100hot.com/ ) are given. These URLs are regarded as initial 
centers of a community, and then it is purified by the system so that higher-ranking 
URLs will be collected as the members of final centers. Average rankings of centers 
for each topic before/after purification are shown in Table 1. 

This table shows that higher-ranking centers are acquired in some of the topics, 
such as Macintosh, Election, and Music. The reasons the system performs well for 
these topics are as follows: 
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1. Topics of these communities are relatively focused than others. In many cases, 
there are representative pages that are referred by most of the community members 
in focused communities. 

2. The graphs of these communities are densely connected. This enables the purifica- 
tion only from hyperlink information. 



Table 1. Average ranking of centers for each topic 
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Besides these topics that our system performs well, there are some other topics that 
the system outputs rather unexpected results. For example, the inputs and outputs for 
topic Magazine is as follows: 

• Inputs: chemweek.com, mysterynet.com, cosmomag.com, playbill.com, si.edu 

• Outputs: washingtonpost.com, nytimes.com, usatoday.com, latimes.com, wsj.com 

This result shows that the topic of the centers are shifted from Magazine to News- 
paper, and it also shows the closeness of the communities of these two topics. Another 
example is the community for topic Travel: 

• Inputs: smarterliving.com, sheraton.com, ebookers.com, qixo.com, hotel.com 

• Outputs: hilton.com, hyatt.com, sheraton.com, marriott.com, holiday-inn.com 

In general, when a target topic is too broad that contains many subtopics, there are 
several representative pages for the topic. In this example, since many hotel sites are 
included in the input URLs, the topic of the community is focused to hotels in the 
process of purification. 

Both HITS and our method are based on the graph structures that are extracted lo- 
cally from the vast Web network. Since our method acquires Web data during the 
process of purification, and renews the members of communities iteratively, it is ex- 
pected that the method performs well even when the members of initial communities 
are not representative ones. 
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5 Concluding Remarks 

This paper proposes a new method for purifying Web communities based on the graph 
structure of hyperlinks. Results of the system that is developed based on our method 
are also shown. The following points should be mentioned for “purifying” our future 
research targets: 

• The method proposed in this paper is considered to be a method for searching a 
complete bipartite subgraph included in a graph that correspond to a community. 
Although the effectiveness of our method depends heavily on the graph structure of 
target communities, typical graph structure of Web communities is not clear. We 
have to study more about the model for such structures that fits well for actual Web 
communities. 

• There is no standard test data set for evaluating systems for Web mining. The 
above experimentation is made on the assumption that URLs listed in each topic of 
100hot.com are ranked in the order of relevance to the topic. However, this as- 
sumption is not always true. In the ranking used for our experimentation, the top- 
ranking URL for topic car is Microsoft.com ! ! In order to evaluate the performance 
of our system objectively, some kind of standard test data set for Web mining is 
really needed. 
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Abstract. Genomic strings are not of fixed length, bnt provide one- 
dimensional spatial data that do not divide for conquering by machine 
learning into manageable fixed size chunks obeying Dietterich’s inde- 
pendent and identically distributed assumption. We nonetheless need to 
divide genomic strings for conquering by machine learning — in this case 
for genomic prediction. 

Orthologs are genomic strings derived from a common ancestor and hav- 
ing the same biological function. Ortholog detection is biologically in- 
teresting since it informs ns about protein divergence through evolution, 
and, in the present context, also has important agricultural applications. 
In the present paper is indicated means to obtain an associated (fixed 
size) attribute vector for genomic string data and for dividing and con- 
quering the machine learning problem of ortholog detection herein seen 
as an analogy problem. The attributes are based on both the typical 
string similarity measures of bioinformatics and on a large number of 
differential metrics, many new to bioinformatics. Many of the differen- 
tial metrics are based on evolntionary considerations, both theoretical 
and empirically observed, in some cases observed by the anthors. 

C5.0 with AdaBoosting activated was employed and the preliminary re- 
sults reported herein re complete cDNA strings are very encouraging for 
eventually and usefully employing the techniques described for ortholog 
detection on the more readily available EST (incomplete) genomic data. 



K.P. Jantke and A. Shinohara (Eds.): DS 2001, LNAI 2226, pp. 290—303, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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1 Introduction 

Genomic strings are strings of one of two types: nucleotide strings and amino 
acid strings. Nucleotide strings are what genes are, and they code for amino acid 
strings which are proteins. We can model each as strings of letters where the 
letters are standard names for the nucleotides or the amino acids. For machine 
learning^ purposes, it is not practical to process genomic strings as fixed-size vec- 
tors (of letters). However, genomic strings can be thought of as one-dimensional 
spatial structures.^ Dietterich [DieOO] discusses in detail the problem for machine 
learning of employing divide and conquer on spatial and temporal data which 
can’t be practically completely represented as fixed-size vectors. Of course such 
data can be divided into manageable fixed size chunks. He notes, though, that 
divide and conquer is problematic if the data fails to satisfy the independent 
and identically distributed (iid) assumption. As we will see below, the problem 
discussed in this paper does not satisfy this assumption, and this paper pro- 
vides, then, among other things, a case study of how in our problem domain we 
circumvent the difficulty. 

In GenBank (major repository of genomic information) there are many hu- 
man and mouse (mammal) genomic sequences with known associated functions; 
there are some but fewer (food animal) chicken sequences with known associated 
functions. Poultry is the third largest agricultural commodity, and the main meat 
consumed in the U.S.^ Gontrol of disease in these birds is important for both 
agricultural economics and human health. The identification of candidate genes 
for disease resistance, or the development of immune enhancers to make vaccines 
more effective or even obsolete are among the more contemporary approaches 
to disease control in this important food animal. However, gene sequence infor- 
mation for birds is currently too limited. Fortunately, as just noted above, some 
information is available, so there is some basis for training a machine learning 
procedure. 

Orthologs are (genomic) sequences which are from different species but which 
have common descent and the same function. Grucially, in a number of cases 
one can locate and compare human, mouse, and chicken orthologs. We’ve been 
concerned, then, with an analogy problem: find/exploit patterns in the known or- 
thologs between human, mouse, and chicken and apply those patterns to human 
and mouse orthologs A, Y with known function, but whose chicken ortholog Z 
is unknown, to detect the unknown Z. 

To find patterns between relatively closely related species, e.g., human and 
mouse, it has sufficed to use known local-alignment-based similarity tools such 
as BLAST (and variants) [AGM+90,AMS“*'97,KA90,Pea95] which are based on 
string similarity only. They find “locally maximal segment pairs.” This similarity 

^ Machine learning [Mit97,RN95] involves algorithmic techniques for fitting programs 
to data and for outputting the programs fit for subsequent use in predicting future 
data. A program so fit to data is said to be learned. 

^ Amino acid sequences fold into 3-D structures, but that, for us, will be taken into 
account in future work. See Section 6 below. 

® http: / /www. usda.gov / news/pubs/fbook98 / chla.htm 
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matching does not suffice for highly divergent orthologs (e.g., some of the or- 
thologs between mammals and birds) since the regions of similarity are too frag- 
mented. For example, Figure 1 depicts an optimal global amino acid sequence 
alignment between chicken and mouse IL-2 orthologs^ (with chicken shown on 
top). The corresponding nucleotide sequence alignment is also very fragmented 
(data not shown). The same degree of fragmentation is seen comparing chicken 
and human IL-2 (data not shown). When searching chicken IL-2 against Gen- 
Bank, BLAST and variants do not and cannot find any locally maximal segment 
pairs in mammals which have statistical significance. This problem is not just for 
IL-2. More generally, it follows from [RYW+00] and recent news releases from 
Celera that more than 25% of orthologs are not identified by commonly used 
(local-alignment-based similarity) tools. 

MMCKVLIFGCISVATLMTTAYGASLSSAKRKPLQTLIKDL-EIL ENIKNKIH 

I I II I I II 

MYSMQLASCVTLTLVLLVNSAPTSSSTSSSTAEAQQqQQqQQQQqQHLEqLLMDLQELLSRMENYRNLKLPRM 
LEL— YTPTEiqECTqqTLqCY LGEVVTLKKETEDDTEIKEEFVTAiqNIEKNLKSLTGLNHTGSEC 

I II I III III I I I I III II 

LTFKFYLPKqATE— LKDLqCLEDELGPLRHVLDLTqSKSFqLEDAENFISNIRVTVVKLK— G-SDNTFEC 
KICEANNKKKFPDFLHELTNFVRYLqK 

III I 

qF— DDESATVVDFLRRWIAFCqSIISTSPq 



Fig. 1. Optimal Global Amino Acid Alignment Between Chicken and Mouse IL-2 



In the analysis of analogy problems from both cognitive psychology [Ste88] 
and artificial intelligence [Eva68,RN95], we see that both similarities and differ- 
ences need to be taken into account. For example, here are a couple of string 
analogy problems from Hofstadter. These problems are based on alphabetical 
order, though, not genomics. 

abc — > abd, ijk — > ? 
abc ^ abg, iijjkk ^ ? 

We see that taking into account both string similarities and differences are a 
necessary part of solving these problems. 

Other projects have employed differential metrics to some degree and to 
good effect. The tools for intron-exon® recognition (not what we are doing in the 
present study), GRAIL [GME+92] and GENSGAN [BK97], employ differential 
metrics (and there is a similarity metric implicit, for example, in the potential 
function in GRAIL). A codon is comprised of a contiguous triple of nucleotides 

^ IL-2 is interleukin 2, an immune system protein. 

® Exons contain the coding portions of genes. 
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chick-human AA identity <= 25.54: no 
chick-human AA identity > 25.54: 

:... chick-mouse NA identity <= 49.5: 

:... chick-human NA length/ (# gaps) > 57.45: no 
: chick-human NA length/ (# gaps) <= 57.45: 

: :.. .mouse A to chick C <= 19.09: no 

: mouse A to chick C > 19.09: yes 

chick-mouse NA identity > 49.5: 

:... chick-human NA length/ (# gaps) <= 118.0588: 

:... chick-mouse NA length/ (# gaps) > 103.7143: no 
: chick-mouse NA length/ (# gaps) <= 103.7143: 

: : . . .chick T to mouse C > 25.59: no 

: chick T to mouse C <= 25.59: 

: :... chick T to human G <= 13.39: yes 

: chick T to human G > 13.39: no 

chick-human NA length/ (# gaps) > 118.0588: [Rest omitted] 



Fig. 2. First Tree Output By C5.0 — With Portion Omitted 



{A, C, G, T}, and 61 of these triples each code for a single corresponding amino 
acid. Differential metrics can be based on so-called codon bias [SM82,SCH+88, 
Li97]. Most of the 20 amino acids are encoded by more than one codon; codon 
bias is, then, the quantifiable phenomenon that an organism uses one particular 
codon for an amino acid significantly more often than all the other synonymous 
codons. [SG94] provides an improvement of BLAST with a measure of codon 
bias as a differential metric. In the present project we employ codon bias as one 
class of differential metrics or attributes: we count, for each of the 61 codons, 
how many times it occurs in the orthologs. 

In our project for (chicken) ortholog detection, we have devised a number 
of other differential metrics also to complement standard similarity metrics for 
genomic sequences. These measures of similarity and differences provide our 
attributes (or features) for machine learning and constitute, in many cases, a 
useful division into parts of the original problem about 1-D strings, a division 
towards conquering the problem. As noted above, this division yields cases where 
the iid assumption fails. Instead, the co-evolution of mammal and bird orthologs 
from common ancestor strings involves whole interdependent string patterns 
coming out partly differently and partly similarly. 



2 Attributes Based on Similarities and Differences 

We mentioned codon bias for differential metrics above. 

A straightforward evaluator of similarity is simple percent identity. Studies 
([Li97], Ghapter 1) have shown with accompanying simple biochemical explana- 
tion that, when mutations occur, the nucleotides A and G tend to change to 
G and A, respectively, and G and T tend to change to T and G, respectively; 
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these are called transitions. The other 8 substitutions, between the group of 
{A, G} and the group of {C,T} are called transversions, and they occur less 
frequently than transitions. Insertions and deletions of nucleotides are thought 
to occur rarely; however, when they do occur, several adjacent nucleotides may 
be involved [B098]. Therefore, another common way to evaluate the quality of 
an alignment is to assign a high score to identity matches, a medium score to 
transitions, a low score to trans versions, a large penalty to opening a gap, and 
a small penalty to extending a gap. Our Table 1 is such a scoring scheme. 



Table 1. A Nucleotide Sequence Alignment Scoring Scheme. 



From\To 


A 


C 


G 


T 


A 


T 


1 


2 


T 


C 


1 


4 


1 


2 


G 


2 


1 


4 


1 


T 


1 


2 


1 


4 



Gap Opening 


Gap Extension 


-5 


-2 



Amino acid sequences are what cells translate nucleotide (gene) sequences 
into (to form proteins). When amino acid sequences are aligned, the scoring 
matrix is a 20 by 20 table because there are 20 amino acids; some commonly 
used matrices for amino acid sequence alignment include PAM and BLOSUM 
families of matrices (see [B098] and the references therein). 

The Needleman-Wunsch algorithm [NW70] finds an optimal global align- 
ment of two sequences. Optimal global alignment has thus far been mostly used 
in comprehensive studies of orthologs, as in [MB98], where orthology has al- 
ready been established, and researchers want to extract additional information 
from the aligned sequences. Global alignment involves some increased complexity 
costs over local alignment schemes, but we’ve seen, for our applications reported 
herein, that this increased cost is not prohibitive; furthermore, we have begun 
using the more efficient variant of Needleman-Wunsch from [Got82]. When we 
apply (the improved variant of) Needleman-Wunsch to obtain global alignment 
values for similarity attributes, for nucleotide alignment we apply the scoring 
scheme in Table 1, and, for amino acid alignment, we apply the scoring scheme 
from the PAM250 matrix. Needleman-Wunsch and its improvement calculate a 
global alignment optimal in the sense that no other alignment yields a higher 
score, global in the sense that the entire lengths of the sequences are taken into 
consideration. 

For our similarity attributes we employ both the Nucleotide Alignment (NA) 
scores and the Amino acid Alignment (AA) scores — comparing chicken with 
each of mouse and human.® These scores are given as percent identities. 

® Applying attribute values for both chicken- mouse and chicken-human comparisons 
improves performance over just employing comparisons between chicken and one of 
these mammals. 
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We noticed an intriguing and biologically significant empirical pattern com- 
paring NA and AA for our current full data set of 213 complete orthologs between 
chicken- mouse-human. In Table 2 this pattern (among other things) is displayed 
for a representative sample of 20 of our orthologs. Table 2 is shown sorted in the 
column of chicken-mouse nucleotide alignment. For the top portion of the table 
(as sorted), chicken-mouse NA percent identity is larger than that of AA, but 
for the bottom portion of the table, the ordering of the two numbers becomes 
reversed. The likely biological/biochemical explanation appears to be: in the top 
portion we see the effects of mutations in third and redundant position in codons 
[Li97]; in the bottom portion we see critically preserved amino acids; and in the 
middle some combination of each. We have employed, then, the values of (NA- 
AA) and NA/AA as attributes measuring the degree to which nucleotide and 
amino acid alignments differ. 

From the NA and AA alignments themselves we calculate lengths, numbers 
of gaps, and their average lengths. We then combine these numbers importantly 
in various ratios to provide differential attributes. The present paper reports 
on our progress with ortholog prediction for complete cDNA sequences. In the 
future we plan to apply our methods to ESTs (incomplete sequence data), and, 
making these attributes ratios is one way that, on average, the incompleteness 
of the ESTs will not bias our attributes compared with their values on complete 
sequences. 

We also employ as attributes the percentages of conservations of the four 
nucleotides, the percentages of transitions, and the percentages of transversions. 

From above, transversions are those nucleotide mutations (e.g., from C or T 
to A) that are less likely to occur biochemically. Table 2 also displays, for the 
20 representative orthologs (out of our 213) transversion bias percents between 
mouse and chicken. We’ve based a number of additional attributes on various 
measures suggested by the biologically quite interesting transversion bias trends 
seen in this table and in the table of all our 213 orthologs. E.g., we have various 
useful attributes measuring deviation from the boldfaced columns for transver- 
sion bias. 

We illustrate these attributes with the example of a particular chicken se- 
quence compared to its mouse ortholog. For such a sequence comparison (corre- 
sponding to the first four columns of a single row of a table like Table 2) there 
are four transversion percentages: (ti, ^ 2 , ta, ^ 4 ) = 

(% of {C,T} ^ A, % of {A,G} ^ G, % of {G,T} ^ G, % of {A,G} ^ T). (1) 

We treat these four numbers as coordinates of a four-dimensional point. The 
general pattern (quite similar to that of the boldface pattern in Table 2): for 
transversions from chicken to mouse, a point will usually have a larger/largest 
second-coordinate than its other coordinates; hence, the points will reside in a 
restricted sub-region in the space. Since the distribution of these points is not 
known, we could use the distribution-free, scaling and rotation invariant measure 
called simplicial depth [BF84,Liu90,LS93,C001] to measure how near a point is 
to the center of the cluster of points. We have experimented, to good effect 
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Table 2. Transversion Bias and Comparative Aligments 



Protein 


Chicken to mouse | 


Mouse to chicken | 




From 


CT 


AG 


CT 


AG 


GT 


AG 


CT 


AG 


% identity 


To 


A 


C 


G 


T 


A 


G 


G 


T 


NA 


AA 


frizzled 7 


2.2 


8.2 


6.9 


2.9 


1.7 


8.6 


7.7 


2.0 


81.8 


87.4 


transforming growth factor (3 3 


6.0 


5.7 


4.0 


1.9 


2.9 


6.7 


5.3 


2.6 


81.1 


87.1 


nicotinic Ach receptor a 1 


3.1 


6.6 


6.1 


2.5 


5.1 


6.5 


3.2 


3.5 


79.4 


84.3 


growth hormone 


4.0 


9.1 


6.4 


5.0 


5.0 


7.1 


8.3 


3.9 


74.8 


73.1 


VEGF 


5.9 


8.2 


3.5 


1.6 


5.7 


3.9 


6.1 


3.9 


74.7 


73.4 


PDGF receptor a 


4.7 


7.0 


5.8 


3.2 


7.0 


3.9 


4.9 


5.2 


74.3 


79.3 


estrogen receptor 


3.2 


9.7 


6.1 


3.3 


8.7 


4.2 


4.7 


4.8 


73.9 


78.3 


PDGF a 


5.7 


8.2 


6.1 


2.4 


7.9 


4.0 


5.2 


5.6 


72.2 


76.7 


FSH receptor 


4.9 


7.9 


4.9 


6.1 


8.1 


5.8 


4.4 


5.2 


71.5 


71.6 


fibroblast growth factor 2 


4.9 


9.7 


8.8 


5.5 


8.0 


5.3 


9.0 


7.0 


70.0 


66.0 


thyrotropin f3 


5.2 


8.2 


6.2 


8.2 


7.8 


4.8 


6.9 


8.0 


69.8 


65.4 


growth hormone receptor 


9.2 


7.0 


5.4 


7.1 


9.8 


6.0 


6.0 


7.0 


66.6 


56.9 


insulin-like growth factor I 


6.6 


11.4 


6.6 


5.0 


11.5 


6.2 


5.8 


6.2 


64.1 


62.9 


prolactin 


9.5 


9.6 


7.2 


8.7 


10.0 


8.1 


7.4 


9.3 


62.2 


50.8 


l3 2 microglobnlin 


17.8 


18.3 


11.7 


11.7 


7.7 


22.9 


22.1 


6.7 


54.7 


42.9 


prolactin receptor 


8.8 


10.1 


6.7 


8.5 


14.1 


6.6 


7.7 


6.5 


54.6 


42.8 


interlenkin 1 /3 


18.8 


11.6 


11.2 


13.2 


11.6 


21.0 


14.2 


7.8 


51.3 


31.7 


interleukin 18 


14.6 


13.3 


11.7 


11.3 


15.1 


10.4 


14.3 


11.4 


51.2 


31.8 


interleukin 15 


10.8 


14.8 


8.8 


13.7 


23.3 


6.6 


9.8 


9.9 


49.6 


33.8 


interleukin 2 


19.3 


11.7 


13.9 


16.7 


24.9 


9.0 


10.4 


17.5 


42.5 


19.9 


1 Left: Transversion bias %s 












(the largest number in each row is boldfaced). 








Right: Nucleotide (NA) vs. Amino acid (AA) sequence Alignments 


as % identities. 




Rows sorted by NA. 















(Section 5 ), with easily computed, one-dimensional projections of the full, more 
difficult to compute simplicial depth: we see not only that t2 tends to be the 
largest of the four, but, when t2 is not the largest, that ti tends to be. We use as 
one-dimensional projections, the following formulas for additional differential at- 
tributes: t2/minimum(ti, ^2, ta, ^4) & ti/minimum(ti, t2) tsj ^4)- The first we call 
a major transversion bias, the second a minor. Similar (but not the same) for- 
mulas are used for the transversion biases from mouse to chicken, and for those 
between human and chicken. Relatively large values in these differential measures 
indicate conformation to the typical transversion bias patterns. 

Lastly we employ some simple protein class information [AKF+QS]"^ (see also 
[TSBOO]) for attributes. 



^ http: / /www. tigr.org/docs /tigr-scripts/egad_scripts / role_report .spl 
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3 How We Obtain Negative Data for Classification 

For the classification of genomic sequences as orthologous or not we want to 
supply for training data both positive and negative instances. 

Our positive data come from our 213 known orthologs. 

We employ two groups of negative data. The first group is of the form 

( human protein Y, mouse protein Y, chicken protein X ), (2) 



where 

— X and Y are in our collection, 

— X and Y are not orthologs, and 

— the two differences in lengths between chicken and each mammal protein is 
less than 30% of the length of the mammalian protein, and at least one of 
the amino acid global alignment identities between chick X and human Y 
or between chick X and mouse Y is greater than or equal to 13% (The 30% 
and 13% figures may be adjusted in the future as appears necessary, etc). 

For our 213 orthologs, there are 1043 data points in the first group. This first 
group corresponds to the type of negative data points on which we would want 
to test a decision program output by a machine learning technique. The second 
group is of the two forms 

( human protein X, mouse protein Y, chicken protein X ) (3) 

and 

( human protein Y, mouse protein X, chicken protein X ), (4) 

and the constraints on the proteins are the same as in the first group. For the 
213 orthologs, there are 2086 data points in the second group. The use of this 
second group considerably improves performance of decision programs output 
by the machine learning technique described in the next section. 

4 Machine Learning Techniqnes Employed 

We employ as our machine learning technique Quinlan’s C5.0 which combines 
his C4.5 for decision tree induction [Qui93,RN95,Mit97] with the option for Ad- 
aBoosting [FS97,FS99]. Decision tree induction involves the fitting of simple 
decision trees with unary-predicate tests to classification data. C5.0 (and C4.5) 
employ an information-theoretic heuristic so that decisions at the top of a tree fit- 
ted explain more data than decisions further down. This provides both efficiency 
in fitting and some readability of the resultant trees for insight — the reasons the 
decision tree induction component was chosen for the project. AdaBoost is an 
important technique for improving learners both for fitting training data [FS96] 
and for generalization and prediction beyond the training data [FSBL98] (see 
also [FMSOl]). It also handles well the presence of errors in the training data 
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[FS96]. AdaBoosting, as employed in C5.0, takes a weighted majority vote of the 
decisions of a sequence of decision trees, where each tree, beyond the first, judi- 
ciously concentrates on the cases difficult for its predecessor.® Since AdaBoosting 
combines a number of decision trees, its use may involve some tolerable loss of 
readability and efficiency. However, AdaBoosting nonetheless looks like linear 
(i.e., fast) programming [FS99]. The features of AdaBoosting just mentioned are 
why it was chosen for the project. 

Other methods might have been chosen. Reported in [MST94] is a major 
series of studies and domains comparing machine learning techniques (including 
decision tree induction and neural net learning) and classical statistical tech- 
niques. Decision tree induction was generally robust over the domains studied 
including compared to statistics. Again, though, it had the advantage that it’s 
products are readable for insight. Of course, each technique compared had its 
especially good domains. In [BB98], for example, we see many bioinformatics 
problems tackled with either neural nets or statistical techniques (but not de- 
cision tree induction with AdaBoosting). Neural nets and statistical methods 
tend not to produce classification programs readable for insight. We do note 
that the Morgan system in [SDFH98] does employ decision tree induction — to 
simplify otherwise complex dynamic programming for doing similarity matching 
for intron-exon recognition in vertebrates.® We also see that [AMS®"93] employs 
a decision tree induction which automatically selects string patterns from a given 
table and produces a decision program which tests input data against the table 
to predict transmembrane domains from protein data. Support vector machines 
[Vap95,Vap98] and neural nets can, in effect, cut up the attribute space in ways 
that decision trees do not. For example, in some cases, there can be advantages 
in decision tree induction to suitably rotate the attribute space; however, Ad- 
aBoosting (more than) makes up for any such advantage [Qui97], and, in effect, 
cuts up the attribute space very finely [Qui98]. Furthermore, support vector 
machines involve quadratic (i.e., slower) programming [FS99]. 



5 Results 

When we run C5.0 with AdaBoost activated on our 213 orthologs (and associ- 
ated negative data) we get ensembles of decision trees with an average of about 
35 decision nodes per tree. These trees are humanly readable. The attributes 
tested in ensembles of trees based all 213 orthologs involve most of our cur- 
rent attributes. The decisions made by such an ensemble with only three trees^® 
makes no mistakes on all the positive and negative data points generated by the 

® Importantly, the voting weights are bigger for more accurate trees in the sequence 
of trees. 

® In the present project we are working only with exons or portions thereof. 

Recall from Section 4 above that the ensemble of trees obtained from AdaBoosting 
makes its decisions by a judiciously weighted majority vote among the decisions of 
its constituent trees — even more usefully subtle decision making than that of any 
single tree. 
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213 orthologs. More importantly, though, we employed 10-fold cross-validation 
(i.e., a random 10-th of the data is removed from training and employed instead 
for testing) with 10 repetitions and obtained, with an boosting ensemble size 
of 25 trees, a low error rate of 2.4% (with Standard Error less than 0.05%) on 
the entire data set for all 213 orthologs. Furthermore, for each of the 213 ways 
of removing one ortholog of the 213, we also tried training on the remaining 212 
(with their associated negative data points) and testing the ensembles obtained 
from C5.0 with AdaBoost activated on the missing ortholog and the (also miss- 
ing) negative data points associated with it. In 95% of the cases that the ortholog 
omitted from the training data was chosen from the important protein class of 
cell/organism defense (which includes the immune system enhancers we are es- 
pecially interested in)^^, ensembles with only four trees performed perfectly on 
all the positive and negative cases including those for the ortholog omitted. 

On our 213 orthologs and associated negative data the first decision tree 
produced by C5.0 (with portion omitted to save space) is shown in Figure 2. 
The tree should be read essentially as an if-then-else program with nesting 
indicated by indentation. The decision yes is for othologous and no is for non- 
othologous. From vertical position in the tree we see, for example, that the top 
test of an amino acid percent identity, chick-humaui AA identity <= 25.54, 
explains more data than the test somewhat below of a transversion percent, 
chick T to human G <= 13.39. In the omitted portion there appears, among 
other tests, the test chick-mouse transversion bias (minor) > 2.292322 

A conclusion of all these results is that, at least for complete cDNA sequences, 
with only our current attribute set, we can apparently explain or cover most of 
the causes behind orthology between the chosen bird and mammal species. Our 
paper also presents a particularly successful application of the machine learning 
method chosen (C5.0). 



6 Future Work 

In the future we need to expand our search for detectable but currently unknown 
chicken orthologs in large and rich databases of chicken ESTs. GenBank and, 
importantly, our own expanding database of now over 17,000 expressed chicken 
ESTs^^ will be crucial sources. Based on the results of the previous section we 
are quite optimistic that, when we train on both complete and EST data in the 
interest of ortholog detection also for ESTs, we can be successful. 

A number of further enhancements of the machine learning techniques 
are also planned, including learning motif (i.e., pattern) information (as in 
[AMS+OO]) and employment of further machine learning modules such as FOCL 
[PBS91,Mit97] to enable better use of explicit background information from em- 
pirical and theoretical evolutionary considerations (such background information 
is implicit in some of the above). Multi-tasking is learning more than one thing 

http://www.tigr.org/docs/tigr-scripts/egad_scripts/role_report.spl 

http://www.chickest.udel.edu 
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for the purpose of mutual enhancement of the learning. It is shown helpful em- 
pirically [SR86,PMK91,Car93,MCF+94,DHB95,Car96,MK96,TS96,BGN97] and 
theoretically, [Ash60,AGS89,KSVW93,KS94,CJO+00]. In the future we hope to 
enhance our ortholog prediction by multi-tasking it with also learning protein 
classes. We also plan to try as additional attributes global alignment metrics 
on the strings coding folding structure that are output by nnpredict [KGL90].^^ 
This algorithm produces good but approximate predictions [KGL90], so it will 
be interesting to see if its predictions help or hinder ours. 

We expect the work preliminarily described herein will help significantly with 
speeding up the discovery of useful new orthologies and also provide general in- 
sights regarding the evolutionary divergence between distantly related species 
(e.g., bird and mammal species). We anticipate agriculturally important appli- 
cations, e.g., it may lead to a reduction in the use of antibiotics in poultry. We 
also plan to apply our techniques to other species, e.g., zebrafish, fugu and frog. 



Acknowledgement. Ming Ouyang was partially supported by a postdoctoral 
fellowship of Delaware Biotechnology Institute and by the USEPA funded Genter 
for Exposure and Risk Modeling (GR827033). 



References 



[AGM+90] 

[AGS89] 

[AKF+95] 



[AMS+93] 

[AMS+97] 

[Ash60] 

[BB98] 

[BF84] 



Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and 
David J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403- 
410, 1990. 

D. Angluin, W. Gasarch, and C. Smith. Training sequences. Theoretical 
Computer Science, 66(3):255-272, 1989. 

M. D. Adams, A.R. Kerlavage, R.D. Fleischmann, R.A. Fuldner, C.J. Bult, 

N. H. Lee, E.F. Kirkness, K.G. Weinstock, J.D. Gocayne, O. White, and 
et al. Initial assessment of human gene diversity and expression patterns 
based upon 83 million nucleotides of cDNA sequence. Nature, 377:3-174, 
1995. 

S. Arikawa, S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi, and T. Shi- 
nohara. A machine discovery from amino-acid-sequences by decision trees 
over regular patterns. New Generation Computing, 11:361-375, 1993. 
Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui 
Zhang, Zheng Zhang, Webb Miller, and David J. Lipman. Gapped BLAST 
and PSI-BLAST: A new generation of protein database search programs. 
Nucleic Acids Research, 25(17):3389-3402, 1997. 

R. Ashby. Design for a Brain: The Origin of Adaptive Behavior. Wiley, 
NY, second edition, 1960. 

P. Baldi and S. Brunak. Bioinformatics: The Machine Learning Approach. 
MIT Press, Gambridge, MA, third edition, 1998. 

E. Boros and Z. Fiiredi. Triangles covering the centre of an n-set. Geome- 
triae Dedicata, 17:69-77, 1984. 



13 



http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html & 
http://www.cmpharm.ucsf.edu/~nomi/nnpredict-instrucs.html 




Divide and Conquer Machine Learning for a Genomics Analogy Problem 301 



[BGN97] 

[BK97] 

[B098] 

[Car93] 

[Car96] 

[CJO+00] 

[COOl] 

[DHB95] 

[DieOO] 

[Eva68] 

[FMSOl] 

[FS96] 

[FS97] 

[FS99] 

[FSBL98] 

[GMF+92] 



Kai Bartlmae, Steffen Gutjahr, and Gholamreza Nakhaeizadeh. Incorpo- 
rating prior knowledge about financial markets through neural multitask 
learning. In Proceedings of the Fifth International Conference on Neural 
Networks in the Capital Markets, 1997. 

C. Burge and S. Karlin. Prediction of complete gene structures in human 
genomic DNA. J. Mol. Biol., 268:78-94, 1997. 

Andreas D. Baxevanis and B.F. Francis Ouellette, editors. Bioinformatics: 
A Practical Guide to the Analysis of Genes and Proteins. John Wiley & 
Sons, Inc., 1998. 

Richard A. Caruana. Multitask connectionist learning. In Proceedings of 
the 1993 Connectionist Models Summer School, pages 372-379, 1993. 

R. Caruana. Algorithms and applications for multitask learning. In Pro- 
ceedings of the Thirteenth International Conference on Machine Learning 
(ICML-96), pages 87-95. Morgan Kaufmann, San Francisco, CA, 1996. 

J. Case, S. Jain, M. Ott, A. Sharma, and F. Stephan. Robust learning 
aided by context. .lournal of Computer and System Sciences (Special Issue 
for COLT98), 60:234-257, 2000. 

Andrew Y. Cheng and Ming Ouyang. On algorithms for simplicial depth. 
In 13th Canadian Conference on Computational Geometry, pages 53-56. 
University of Waterloo, August 13-15 2001. 

Thomas G. Dietterich, Hermann Hild, and Ghulum Bakiri. A comparison 
of ID3 and backpropogation for English text-to-speech mapping. Machine 
Learning, 18(l):51-80, 1995. 

T. Dietterich. The divide-and-conquer manifesto. In Proceedings of The 
11th International Workshop on Algorithmic Learning Theory (ALT’OO), 
Lecture Notes in Artificial Intelligence, pages 13-26. Springer- Verlag, 
Berlin, 2000. 

T. Evans. A program for the solution of a class of geometric-analogy 
intelligence-test qnestions. In M. Minsky, editor. Semantic Information 
Processing, pages 271-353. MIT Press, 1968. 

Y. Freund, Y. Mansour, and R. Schapire. Why averaging classihers can 
protect against overfitting. In Proceedings of the Eighth International 
Workshop on Artificial Intelligence and Statistics, 2001. 

Y. Freund and R. Schapire. Experiments with a new boosting algorithm. 
In Proceedings of the Thirteenth International Conference on Machine 
Learning (ICML-96), pages 148-156. Morgan Kaufmann, San Francisco, 
CA, 1996. 

Y. Freund and R. Schapire. A decision-theoretic generalization of on-line 
learning and an application to boosting. Journal of Computer and System 
Sciences, 55:119-139, 1997. 

Y. Freund and R. Schapire. A short introduction to boosting. Journal 

of Japanese Society for Artificial Intelligence, 14(5):771-780, 1999. In 

Japanese and translated by Naoki Abe; English version at 

http: / / www.research.att .com /~schapire / cgi-bin / uncompress-papers / 

FreundSc99.ps. 

Y. Freund, R. Schapire, P. Bartlett, and W. Lee. Boosting the margin: A 
new explanation for the effectiveness of voting methods. The Annals of 
Statistics, 26(5): 1651-1686, 1998. 

X. Guan, R.J. Mural, J.R. Einstein, R.C. Mann, and E.C. Uberbacher. 
GRAIL: An integrated artificial intelligence system for gene recognition 
and interpretation. In Eighth IEEE Conference on AI Applications, pages 
9-13, Monterey, CA, March 2-6 1992. IEEE Computer Society Press. 




302 M. Ouyang, J. Case, and J. Burnside 



[Got82] 

[KA90] 

[KCL90] 

[KS94] 



[KSVW93] 

[Li97] 

[Liu90] 

[LS93] 

[MB98] 

[MCF+94] 

[Mit97] 

[MK96] 

[MST94] 

[NW70] 

[PBS91] 

[Pea95] 

[PMK91] 

[Qui93] 

[Qui97] 



O. Gotoh. An improved algorithm for matching biological sequences. J. 
Mol. Biol, 162:705-708, 1982. 

Samuel Karlin and Stephen F. Altschul. Methods for assessing the statis- 
tical significance of molecular sequence features by using general scoring 
schemes. Proc. Natl. Acad. Sci. USA, 87:2264-2268, 1990. 

D. G. Kneller, F. E. Cohen, and R. Langridge. Improvements in protein 
secondary structure prediction by an enhanced neural network. Journal 
of Molecular Biology, 214:171-182, 1990. 

M. Kummer and F. Stephan. Inclusion problems in parallel learning and 
games. In Proceedings of the Workshop on Computational Learning The- 
ory, pages 287-298. ACM Press, NY, July 1994. Journal version to appear. 
Journal of Computer and System Sciences (Special Issue for COLT’94), 
52(3):403-420, 1996. 

E. Kinber, C. Smith, M. Velauthapillai, and R. Wiehagen. On learning 
learning multiple concepts in parallel. In Proceedings of the Workshop on 
Computational Learning Theory, pages 175-181. ACM, NY, 1993. 
Wen-Hsiung Li. Molecular Evolution. Sinauer Associates, Inc., 1997. 

R.Y. Liu. On a notion of data depth based on random simplices. The 
Annals of Statistics, pages 405-414, 1990. 

R. Y. Liu and K. Singh. A quality index based on data depth and multivari- 
ate rank tests. Journal of American Statistical Association, 88:252-260, 
1993. 

Wojciech Makalowski and Mark S. Boguski. Evolutionary parameters 
of the transcribed mammalian genome: An analysis of 2,820 orthologous 
rodent and human sequences. Proc. Natl. Acad. Sci. USA, 95:9407-9412, 
1998. 

T. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski. 
Experience with a learning, personal assistant. Communications of the 
ACM, 37:80-91, 1994. 

T. Mitchell. Machine Learning. McGraw Hill, 1997. 

S. Matwin and M. Kubat. The role of context in concept learning. In 
M. Kubat and G. Widmer, editors, Proeeedings of the ICML-96 Pre- 
Conference Workshop on Learning in Context-Sensitive Domains, Bari, 
Italy, pages 1-5, 1996. 

D. Michie, D. Spiegelhalter, and C. Taylor, editors. Machine Learning, 
Neural and Statistical Classification. Ellis Horwood, NY, 1994. 

Saul B. Needleman and Christian D. Wunsch. A general method applicable 
to the search for similarities in the amino acid sequence of two proteins. 
J. Mol. Biol, 48:443-453, 1970. 

M.J. Pazzani, C.A. Brunk, and G. Silverstein. A knowledge-intensive ap- 
proach to learning relational concepts. In L. Birnbaum and G. Gollins, 
editors. Proceedings of the 8th International Workshop on Machine Learn- 
ing, pages 432-436. Morgan Kaufmann, 1991. 

William R. Pearson. Comparison of methods for searching protein se- 
quence databases. Protein Science, 4:1145-1160, 1995. 

L. Pratt, J. Mostow, and C. Kamm. Direct transfer of learned information 
among neural networks. In Proceedings of the 9th National Conference on 
Artificial Intelligence (AAAI-91), 1991. 

J.R. Quinlan. Cf.5: Programs for Machine Learning. Morgan Kaufmann 
Publishers, San Mateo, CA, 1993. 

J.R. Quinlan, 1997. Private communication. 




Divide and Conquer Machine Learning for a Genomics Analogy Problem 303 



[Qui98] 

[RN95] 

[RYW+00] 



[SCH+88] 

[SDFH98] 

[SG94] 

[SM82] 

[SR86] 

[Ste88] 

[TS96] 

[TSBOO] 

[Vap95] 

[Vap98] 



R. Quinlan. Miniboosting decision trees. Journal of AI Research, 1998. 

S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. 
Prentice-Hall, NJ, 1995. 

Gerald M. Rubin, Mark D. Yandell, Jennifer R. Wortman, George L. Ga- 
bor Miklos, Catherine R. Nelson, Iswar K. Hariharan, Mark E. Fortini, 
Peter W. Li, Rolf Apweiler, Wolfgang Fleischmann, J. Michael Cherry, 
Steven Henikoff, Marain P. Skupski, Sima Misra, Michael Ashburner, 
Ewan Birney, Mark S. Boguski, Thomas Brody, Peter Brokstein, Su- 
san E. Celniker, Stephen A. Chervitz, David Coates, Anibal Cravchik, 
Andrei Gabrielian, Richard F. Falle, William M. Gelbart, Reed A. George, 
Lawrence S.B. Goldstein, Fangcheng Gong, Ping Guan, Nomi L. Harris, 
Bruce A. Hay, Roger A. Hoskins, Jiayin Li, Zhenya Li, Richard O. Hynes, 
S.J.M. Jones, Peter M. Kuehl, Bruno Lemaitre, J. Troy Littleton, De- 
brah K. Morrison, Chris Mungall, Patrick H. O’Farrell, Oxana K. Pickeral, 
Chris Shue, Leslie B. Vosshall, Jiong Zhang, Qi Zhao, Xiangqun H. Zheng, 
Fei Zhong, Wenyan Zhong, Richard Gibbs, J. Craig Wenter, Mark D. 
Adams, and Suzanna Lewis. Comparative genomics of the eukaryotes. 
Science, 287:2204-2215, 2000. 

Paul M. Sharp, Elizabeth Cowe, Desmond G. Higgins, Denis G. Shields, 
Kenneth H. Wolfe, and Frank Wright. Godon usage patterns in es- 
cherichia coli, bacillus subtilis, saccharomyces cerevisiae, schizosaccha- 
romyces pombe, drosophila melanogaster and homo sapiens: a review 
of the considerable within-species diversity. Nucleic Acids Research, 
16(17):8207-8211, 1988. 

Steven Salzberg, Arthur L. Delcher, Kenneth H. Fasman, and John Hen- 
derson. A decision tree system for finding genes in DNA. Journal of 
Computational Biology, 5(4):667-680, 1998. 

David J. States and Warren Gish. Combined use of sequence similarity 
and codon bias for coding region identification. Journal of Computational 
Biology, l(l):39-50, 1994. 

R. Staden and A.D. McLachlan. Codon preference and its use in identify- 
ing protein coding regions in long DNA sequences. Nucleic Acids Research, 
10(1):141-156, 1982. 

Terrence J. Sejnowski and Charles Rosenberg. NETtalk: A parallel net- 
work that learns to read aloud. Technical Report JHU-EECS-86-01, Johns 
Hopkins University, 1986. 

R. Sternberg. The Triarchic Mind. Viking, NY, 1988. 

S. Thrun and J. Sullivan. Discovering structure in multiple learning tasks: 
The TC algorithm. In Proceedings of the Thirteenth International Confer- 
ence on Machine Learning (ICML-96), pages 489-497. Morgan Kaufmann, 
San Francisco, CA, 1996. 

V. Tirunagaru, L. Sofer, and J. Burnside. An expressed sequence tag 
database of activated chicken T cells: Sequence analysis of 5000 cDNA 
clones. Genomics, 2000. In press. 

V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, 
New York, 1995. 

V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 
1998. 




Towards a Method of Searching a Diverse Theory Space 
for Scientific Discovery 



Joseph Phillips 

University of Pittsburgh 
Computer Science Dept. 

Pittsburgh, PA 15260, USA 
j osephp@cs . pitt . edu 

Abstract. Scientists need customizable tools to help them with discovery. We 
present an adjustable heuristic function for scientific discovery. This function 
may be considered in either a Minimum Message Length (MML) or a Bayesian 
Net manner. The function is approximate because the default method of 
specifying theory prior probabilities is a gross estimate and because there is 
more to theory choice than maximizing probability. We do, however, effec- 
tively capture some user preferences with our technique. We show this for the 
qualitatively different domains of geophysics and sociology. 



1 Introduction 

Our ultimate goal is to write a general program to assist scientists in creating and 
improving scientific models. Realizing this goal requires progress in machine learn- 
ing, knowledge discovery in databases, data visualization and search algorithms. It 
also requires progress in scientific model preferencing. The scientific model prefer- 
ence problem is compounded by the fact that several scientists with very similar 
background knowledge may see the same data but may prefer different models. This 
paper is the first in an on going study to address scientific model preferencing issue. 

Scientific discovery can be viewed as a parameter search in a large and extremely 
inhomogeneous space. Physicists, for example, prefer strong relationships between 
numeric values (e.g., equations) when they can be found. They also, however, use 
knowledge that is more conveniently expressed hierarchically in decision trees and 
semantic nets. This is exemplified by the classification of, and the assigning of 
fundamental properties to subatomic particles. 

The minimum message length (MML) criterion is a mathematically well- 
grounded approach for choosing the most probable theory given data [21] [8] [24] [5]. 
Inspired by information theory, the criterion states that the most probable model has 
the smallest encoding of both the theory and data. Ideally, the theory’s encoding 
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results from a domain expert’s estimation of its prior probability and is language 
independent. The encoding of the data should also be probabilistic: as a function of a 
given theory. 

Despite its generality and power for finding parameters in single classes of mod- 
els (e.g., the class of polynomials), many have expressed skepticism about whether 
MML may meaningfully be applied to finding parameters in inhomogeneous model 
spaces (e.g., general scientific discovery). Cheeseman, for example, states “although 
finding the most probable domain model is often regarded as the goal of scientific 
investigation, in general, it is not the optimal means of making predictions.” [5] 

Our immediate, limited goal is to devise a heuristic function that can help users in 
large and inhomogeneous model spaces. Ideally, a search algorithm that is informed 
with our heuristic will return several regions in the model space that contain 
promising models, some known and some novel. Our approach is to adapt MML in a 
customizable manner. 

1. We make MML applicable to a larger set of scientific discovery by mapping its 
terms onto those used by scientists: theory, laws and data. The MML theory is 
mapped to scientific theory. The MML data is split into scientific laws and data. 

2. We make our heuristic function adjustable, but in a principled manner, by giving 
the user only two calibration parameters. These parameters directly correspond to 
the relationship between scientific theory and law, and scientific theory and data. 
It would be nice if we could ignore differences between theories and pretend that 
there is one “best” theory for all scientists. This, however, ignores significant evi- 
dence that scientists differ in opinion, e.g., see [10][15]. 

We judge our function based on criteria for heuristic functions: generality, ease of 
computation, simplicity and smoothness. 

We do not claim that we have “solved” this problem. The feature set by which to 
judge theories and the identification of the “best” model remain unsolved problems. 

1. We offer no good guidance in developing the theory’s prior probability. Cheese- 
man and others have stressed the importance of using domain knowledge to spec- 
ify the theory’s prior probability. They have also stated that syntactic features are 
often a poor substitute. We are aware of no general algorithm for the estimation of 
a theory’s prior probability. Although our technique is not limited to syntactic fea- 
tures, we use them in this paper. Our approach is compatible with more principled 
prior probability specifying techniques. 

2. We make no claim that the “best” theory will result from this approach. This is 
due to (1) the unsolved prior probability problem, (2) to the difficulty in searching 
a large and inhomogeneous model space, and (3) the fact that the most probable 
model may or may not be the best model. 
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We have developed a useful heuristic function despite these two major limitations. 
Its generality is tested by analyzing its performance in two completely different 
domains: sociology and geophysics. 

This paper is organized as follows. Section 2 discusses previous approaches to 
automated scientific discovery. Section 3 briefly introduces MML. Our approach is 
detailed in section 4. Section 5 presents and discusses our experiments. Section 6 
concludes. 

2 Scientific Discovery 

Several criteria have been proposed by philosophers of science for comparing com- 
peting hypotheses [3]. Among them are accuracy/empirical support, simplicity, nov- 
elty and cost/utility. Most automated approaches consider accuracy and simplicity. 

IDS by Nordhausen and Langley was perhaps the first general program for sci- 
entific discovery [18] [19]. IDS takes as input an initial hierarchy of abstracted states 
and a sequential list of “histories” (qualitative states, see [6]). Using each history IDS 
modifies the affected nodes of the abstracted state tree to incorporate any new knowl- 
edge gained from that history. Its output is a fuller, richer hierarchy of nodes repre- 
senting history abstractions. 

Thagard introduced Processes of Induction (or PI), to propose a computational 
scheme for scientific reasoning and discovery, but not as a working discovery tool 
[23]. PI represents models as having theories, laws and data. It evaluates scientific 
models by multiplying a simplicity metric by a data coverage metric. The simplicity 
metric is a function of how many facts have been explained and of how many co- 
hypotheses were needed to help explain them. The evaluation scheme is fixed and has 
no notion of degree of inaccuracy. 

Zytkow and Zembowicz developed 49er, a general knowledge discovery tool 
[27][26]. It has a two stage process for finding regularities in databases. The first 
stage creates contingency tables (counts of how often values of one attribute co-occur 
with those of another) for pairings of database attributes. The second stage uses the 
contingency tables to constrain the search for other, higher order, regularities (e.g. 
taxonomies, equations, subset relations, etc.) 

Valdes-Perez has suggested searching the space of scientific models from the 
simplest to ones with increasingly more complexity, stopping at the first that fits the 
data. MECHEM uses this approach to find chemical reaction mechanisms [25]. Such 
orderings would be easy to encode as heuristic functions. 

We extend these approaches by using an adjustable, explicitly mentioned heuris- 
tic function that does not require enumerating all possible models. Our approach is to 
generalize Thagard’ s scheme and place it on sounder theoretical footing. 
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3 Information Theory and Diverse Model Discovery 

The MML criterion is to minimize the sum of the length of a theory and data given 
the theory. Some data will have a smaller combined compressed length than the orig- 
inal message. For example, the pitch and relative durations of some bird calls may be 
written in musical notation. This notation dramatically reduces the information from 
the original time-dependent air-pressure signal that the bird produced. Flowever, 
many sounds are not appropriately described by musical notation (e.g., human 
speech). The original time-dependent air-pressure signal will be a better representa- 
tion than musical notation. 

The equation that relates these terms for data set D; context c; discrete, mutually 
exclusive and exhaustive hypotheses {Ho, Hi .. H„} with assigned prior probabilities 
p(Hjlc); and computed conditional data probabilities p(DIHi,c) is: 

( 1 ) 

-logp(H^\D, c) = - logp(Hi\c) -logp(D\Hp c) + const 

which is equation (2) of [5]. Recall that the -log(p(choice)) is the Shannon lower 
bound on the information needed to distinguish choice from other possibilities. The 
constant term serves to “normalize” the probabilities and may be ignored if you only 
want their relative order. Cheeseman gives this iterative process for applying MML: 

1. Define the theory space. 

2. Use domain knowledge to assign prior probabilities to the theories. 

3. Use Bayes’ theorem to obtain the posterior probabilities of the theories given the 
data from adequate descriptions of the theories (i.e., from descriptions that let you 
compute p(DIHi,c) ). 

4. Search the space with an appropriate algorithm. 

5. Stop the search when a probable enough theory has been found (subject to compu- 
tational constraints), or to redefine the theory space or prior probabilities. 

Several obstacles hamper efforts to apply MML to general scientific discovery. 
Among them are the specification of the initial theory prior probabilities, the inher- 
ently iterative nature of MML, and the difficulty in searching this space for a true 
“highest probability” theory. 

Like other MML efforts, there is no good rule for specifying an initial set of prior 
probabilities. Although Cheeseman and others warn about using syntactic features, 
this may be the easiest approach to try in a new domain. 

MML is an inherently iterative process of redefining theory spaces and prior 
probabilities. This complicates the usage of any function that needs calibration. 
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The scientific theory search space is expected to be highly irregular, hampering 
the search for the “best” model. This is true of other domains. Cheeseman suggests 
simulated annealing and the EM algorithm as potential search mechanisms. 

4 Our Approach 

We do not claim to have an optimal heuristic function in terms of returning the truly 
“best” model. Rather, our goal is to create a decent heuristic function that may help 
scientists on their initial searches with large, inhomogeneous spaces. 

Good heuristics for real-world problems are often tricky to design [16]. We eval- 
uate our function based on four criteria: 

1. Generality over different sciences: We seek a function that is applicable to both 
primarily conceptual models as well as primarily numeric. 

2. Ease of computation: The function should not rely too heavily on values that are 
computationally difficult to obtain. And, once it has its values, it should be rapidly 
computable. 

3. Simplicity of form: There are several competing beliefs for how scientific models 
should be evaluated. The function’s design should be as transparent as possible so 
that its assumptions are readily comprehended. 

4. Smoothness: The function should give similar models similar scores. 

We chose these criteria because they are important to our long-term goal of creating a 
general program to assist a variety of scientists. 

Our contributions are the improvements in generality and ease of computation 
over Thagard’s function. Generality is improved in three ways. First, it is adjustable 
to the tastes of a particular scientist. Second, it is able to handle degrees of inaccu- 
racy. Lastly, it may use statistical arguments as well as proofs. Statistical arguments 
also improve the ease of computation: the function does not have to try to formally 
prove laws or data using perhaps an undecidable theory. The form of our function, 
however, is a little more detailed than Thagard’s. The smoothness of both of our 
approaches critically depends upon how the user designs models. 

Following Thagard, models have three components: a theory that specifies the 
details of the model, the data to predict, and a set of laws found from the data and pre- 
dicted by the theory. The theory and the law set are both composed of assertions in 
some language. We use first order predicate logic with the data structure extensions of 
Prolog as our language in this paper. The distinction between which assertions are 
theory and which are laws is given by Lakatos. He distinguishes between commonly 
accepted knowledge (the “hard core”, i.e., theory) and between more tentatively held 
knowledge (the “auxiliary hypotheses”, i.e., laws) of a given research program 
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[12] [13]. The auxiliary hypotheses are the statements that are not commonly held 
(i.e., have lower prior prohahility), and are the main objects that are manipulated dur- 
ing Kuhnian normal scientific discovery [10]. The data is assumed to be in tabular 
form with associated uncertainties and error bars. 

It is simplest to assume that: 

1 . all measurements are independent of each other, 

2. the data influence the choice of law set, and 

3. the law set influences the choice of theory assertions. 



Figure 1 depicts these assumptions graphically as a Bayesian network. 



data 




Fig. 1. Bayesian network underlying the relationship between data, laws and theory 



We are interested in the most probable total model. We derive the following start- 
ing from the Bayesian network of figure 1. Let T denote theory, LS denote a set of 
laws, and D denote data below: 

( 2 ) 



p{T,D) = p{T\D)-p{D) 



Using Bayes’ rule we may re-write this as: 



(3) 



= '^p{T)-p(LS-\T)-p(D\LS-) 

i 

The last expression sums over all law sets and is appropriate when there may be 
disagreement over which law set is best {e.g., several scientists combining their 
beliefs). However, for an individual scientist, a particular law set may appear much 
more probable than any of its competitors. In this case we may simplify the expres- 
sion to: 

(4) 



p{T,D) = p{T) ■ p(LS\T) ■ p(D\LS) 



Now we consider the meaning of each term. 
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The first term of equation 4 tells us the a priori probability of a theory, without 
reference to the law set or data. It encodes the biases on theories. It may be used, for 
example, to prefer one type of assertion over another. A commonly mentioned bias in 
science is one for syntactic simplicity, which is often measured as the length of an 
expression in a given language. This first term is the natural place to encode such a 
bias because this common measure of simplicity is only a function of the length of the 
expression. 

(5) 



p{T) = -log2(s(D) 



The function s(T) returns a measure of the size of T in some language. The function 
p(T) uses Shannon information theory to convert from a size to a probability. 

We admit that the syntactic length metric is crude. We welcome scientists to 
redefine p(T) as they choose based upon their own domain knowledge. In defense of 
this initial estimate of p(T) we note that syntactic metrics: (1) are easy to compute, (2) 
are well agreed upon as being relevant (if not completely correct), (3) are common to 
many or all sciences (as opposed to symmetry, for example, which enjoys larger sup- 
port among physicists than among other scientists), and, (4) would favor syntactically 
simple theories, which may be easier to comprehend. The last point is especially rele- 
vant for initial probability distributions, which may return several interesting model 
space regions that scientists must understand before determining if they warrant fur- 
ther exploration. 

The second term tells us how likely the assertions of the law set are given the the- 
ory that we have chosen. At one extreme, if all laws are logically entailed by the the- 
ory, the term is 1.0 because they must be true (given the theory as premises). It is also 
1.0 if the law set is empty because the theory is used to directly compute the data. At 
the other extreme, the term must be 0.0 if the theory contradicts any statement of the 
law set. Values in between signify that the law set may or may not follow, depending 
on specific values of free parameters in the theory. Free parameters are values that the 
theory refer to that do not have definite values, but distributions over sets of values. 
Examples include coefficients with standard deviations, and random numbers used 
during stochastic experiments. In these cases, the second term is set equal to the frac- 
tion of the free parameter space in which all of the statements of the law set are found 
to hold. For random numbers it will be more practical to estimate this value by sam- 
pling the space. Laws are limited to refer to the theoretical terms introduced in theo- 
ries. 

The third term measures empirical support and the degree of data coverage by 
telling us how likely the data are given the statements of the law set and theory. The 
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same extremes hold when all of the data are logically entailed or some of it is contra- 
dicted by the law set or theory. Again, values in between 0.0 and 1.0 represent the 
fraction of the free parameter space in which the data are observed. Statistical asser- 
tions have an implicit free parameter that tells from which data set the statistic was 
collected. For example, consider two integers, each in the set [0..9], with an average 
value of 1. The implicit free parameter must denote one of three sets: (1,1), (0,2) or 
(2,0). 

Please consider this (propositional) example. Let our theory be the assertion “a- 
>b”, our law be “a” and our data be two occurrences of “b.” We would pay the appro- 
priate (perhaps syntactic) price for the theory. The law is not derivable from the the- 
ory, so we set its probability to p(a) (the a priori probability that free variable A 
which ranges over “a” and “not(a)” actually is “a”). From our theory and law we may 
deduce our data with probability 1. If, however, we add assertions “c->a” and “c” to 
our theory then we have (perhaps) increased theory cost, but the law is now deducible 
from theory. Thus, the law has probability 1 and has no cost. 

A problem with the heuristic function as given is that it has no parameters to be 
tuned to a particular scientist’s preferences. This implies that it always returns the 
same value for the same arguments. This contradicts our goal of not imposing one 
ideal form on all scientific models. 

Scientists should be able to fine tune the heuristic function, but any adjustment 
should be general enough to be applicable to all models. Further, we want the number 
of parameters to be relatively small, both because it will make the function easier to 
calibrate and because we want to guard against potential abuse by choosing a set of 
parameters that happen to make one model score well and a similar one score poorly. 
Our solution was to generalize the function in the following manner: 

(6) 

(T,LS,D) = p(T)^ ■ p(LS\T)^ ■ p(D\Lsf 

The “tm” signifies that the function is over total models (i.e. theory, law set and 
data) and the “+” reminds us that this a function to maximize (i.e., larger values are 
better). The three parameters A, B and C allow us to independently vary the relative 
weights of the a priori model probability, the law set probability and the data proba- 
bility. 

Instead of maximizing probability, we may view it as minimizing information: 

(7) 

= A- s(T)-(Blog2ip(LS\T)))-(Clog2ip(D\LS)) 



The subscript denotes that this function should be minimized. 
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Equation 7 generalizes original MML equation 1 in two ways. First, equation I’s 
-log p(D\H.,c) has been split into two terms, one for both the law set and the data. 
Both are graded probabilistically. Second, the coefficients A, B and C act as linear 
weights for the information terms. The linear weights may seem to grossly over gen- 
eralize equation 1, but it really depends on how they are used. This is discussed in 
more detail in the next section. 

There are two advantages to this weighing approach. First, it conforms to our 
notions that some sciences value theory conciseness and hard predictions more than 
others. Set the values of A and C higher in these sciences. Second, it does not allow 
arbitrary and contrived exceptions to make two similar total models score signifi- 
cantly differently. 

Although we have offered a syntactic feature-based approach to specifying a the- 
ory’s a prior probability, we have not limited scientists to use our function. Further, 
we admit that this is an iterative approach where probabilities are refined. 

Revisiting our criteria we find: 

1. Generality is achieved with the adjustable weights, the usage on probabilities of 
laws instead of counts of “explained facts”, the usage of prior distributions instead 
of “co-hypotheses”, and the potential use of proofs or statistical arguments. 

2. The ease of computation is limited by our proof or statistical argument method, 
not by the heuristic. 

3. Simplicity is achieved because the form is of a weighted sum with terms for the- 
ory, law and data. 

4. Smoothness is achieved because lumping all theory together, all laws together and 
all data together hampers a user’s ability to create one model that scores well and 
another very similar one that scores poorly. 

Further generalizations of h„„^ and h„„ may be envisioned. Each of the coefficients 
A, B and C may split into several coefficients A[l..nJ, B[l..nJ and C[l..n 3 ]. These 
finer-grained coefficients may be used to weigh specific aspects of the theory (e.g. 
A[l] for equations, A[2] for decision trees, etc.), specific laws of the laws set (e.g. 
B[l] for equations, B[2] for simple logical assertions, etc.), and specific types of data 
(e.g. C[l] for spatial measurements, C[2] for temporal measurements, etc.) 

Using the finer-grained coefficients is justifiable in some cases, like when there 
are large differences in the precision. For example, in seismology, earthquake times 
are known with very high precision: to within a few seconds per century. Earthquake 
locations are known with less precision: to only within tens of kilometers per 40,000 
km (the Earth’s circumference). Earthquake energies are known with far less preci- 
sion, frequently only to an order of magnitude. We may want to weigh each type of 
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data separately, taking into consideration how much precision is given and how much 
we want this data fit at the expense of other data. 

Parameters A, B and C from equations 6 and 7 were not subdivided to simplify 
analysis and presentation. 

5 Experiments and Discussion 

This section discusses the rough calibration of the heuristic function to models in two 
sciences. Geophysics and sociology were chosen because they cover a broad spectrum 
of acceptable scientific models. 

We do not evaluate this function by comparing its output with that of IDS, PI, 
49er, or Mechem. Which model a scientist believes in given specific data is, at least to 
some degree, subjective. Rather, we seek a method of calibrating our heuristic such 
that if it is given examples of models that users like then it can prefer similar models 
in the future. 

The heuristic function’s parameters may be calibrated for each science by ana- 
lyzing its accepted models. Although there are three parameters, we only care about 
are their relative values. Accordingly, we may set A to 1 and let B and C vary. Equiv- 
alently, borrowing from physical chemistry, we can plot B/A versus C/A to create a 
“phase diagram” that tells which of the various total models are preferred by the heu- 
ristic. Each phase diagram constrains the area of each scientific model. This in turn 
constrains B/A and C/A for all models. 

Comparing B/A with C/A makes the linear weights of equation 7 a conservative 
generalization of equation 1. The plots are primarily a comparison between B and C, 
and represent a value judgement on how much scientists want their uncertainty in the 
laws rather than in the data. There is no “correct” answer to this question. As we will 
see, it varies from scientist to scientist. This also strengthens our argument for an 
adjustable heuristic function. 

If a scientist prefers model X then that scientist should set the parameters to 
where X is preferred. If the scientist is strongly tempted by model Y, then the scientist 
should adjust the parameters to be in the region of X but leaning towards that of Y. 
The scientist may iteratively update the parameter values as new models are evaluated 
by both the scientist and the heuristic. 

Please recall our limited goal: to do an initial search in a large and inhomoge- 
neous space for areas that contain potentially promising models. We do not promise 
the best models. Also, this may be an iterative process where theory prior probabili- 
ties are revised according to previous results. 
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The Knowledge Base and How It Predicts 

The experiments were designed for a variant of the knowledge base discussed in [20]. 
The knowledge base has two lists of assertions, one for the theory and one for the 
laws. These assertions describe a standard is_a frame hierarchy of knowledge. Asser- 
tions may be frame inheritance statements, equations or Prolog-like logic sentences. A 
Prolog-like resolution engine drives inference, but dedicated code handles frame 
inheritance and equations for efficiency. 

The output of the knowledge base to a given query is either an answer, or FAIL- 
URE, signifying no prediction is possible. An information cost accrued by the data 
when a prediction is wrong or missing. For symbolic values this cost is the Shannon 
information cost of the prior probability of the recorded answer. Thus, the default 
model to try to beat is the product of the prior probabilities of each datum. For inte- 
gers and fixed and floating point values the cost is:‘ 

( 8 ) 



-\og2{DistinctValDiff(predict, record) + 1) 

where DistinctV alDiff() returns the number of distinct, representable values between 
the predicted and recorded values in the attribute’s given precision. (For example, if 
an attribute was limited to multiples of 0.1 then DistinctValDiff(0. 2,0.4) is 2.) When 
predict is missing then the function is set to its highest value for that attribute. 

Sociology Data 

This technique requires large amounts of calibration data. We focused on models of 
family structure because United States Census data on family structure are readily 
available [4]. 

Data are not available for specific individuals, but they are summarized in several 
tables. From these summaries the number of families with 1, 2, 3, 4, 5, and 6 or more 
“own children” may be calculated for each family type. The family types are married 
family, male-householder family, female-householder family, married subfamily, 
male-householder subfamily and female-householder subfamily. Additionally, the 
number of childless families (but not subfamilies) may be calculated. The term “own 
children” means children related by birth, marriage or adoption. The U.S. Census 



Equation 8 corresponds to the last term of equation 7. It defines a maximal probability at the 
recorded value, and exponential decaying probability above and below that value. This 
distribution may be replaced by others and is not a critical aspect of this approach. 
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Bureau switched from “head of house” to “householder” to emphasize the sharing of 
responsibilities prevalent in modern American families. The term “subfamily” refers 
to parent(s) who live with other adult(s) who are the householder(s) (e.g. their own 
parent(s).) 

We randomly created a database of 10,000 people in proportion to the distribu- 
tion of household types and number of children computed from the U.S. Census data. 
This database under represents the number of children a little because the U.S. Census 
data does not distinguish between 6 or more children. We treated such cases as 
exactly 6 children. It under represents the number of adults more because we made no 
attempt to include all cases of adults living with other adults. Our interest is only in 
predicting where children live as a function of their parents. The database lists each 
person, their address, and, when the person is a child, their mother and father. Chil- 
dren who did not live with their father got illegal values as their father attribute. This 
was also done for the mother attribute. All attributes are symbolic. 

Sociology Models 

After surveying ethnographic reports on 250 societies, Murdock came to the anti-cli- 
matic conclusion that the form of families in all societies is of “. . . a married man and 
woman with their offspring. [17]” (This is a minimal family structure because that 
unit may be embedded in larger structures.) 

We take this statement as the theory. We encode it in the structure of the virtual 
relations of figure 2, augmented with some extra semantics. For example, from the 
structure of the database we may deduce that all families have one address, one child- 
set, one mother, one father, that a set of children may have 0 or more children, etc. 
The additional rules allow members to inherit selected properties of their families. 
Predicate prop(frame, attribute, value) notes that property attribute of frame has 
value value. 



familv 


address childset mother father 











child 


childset fami 











\/(child(C) /\ fam{F) a prop(C, family, F) a prop{F, A, V) — > propiC, A, V)) 
y(fam{F) A prop(F, mother, M) a propiF, addr, ADDR) prop(M, addr, ADDR)) 

etc 



Fig. 2. Codification of Murdock’s theory 
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The laws operationalize the theory by making direct predictions about recorded 
values. For example, assume the child database included address information. We 
may then note a correlation between a child’s address and that of their parent’s. 

'^{child{C) A mom{M) AfamiF) a prop{C, mom, M) a prop(M,fam, F) a prop{F, addr, A) — > prop{C, addr, A)) 
'^{childiQ Adad(P) Afam(F) Aprop{C, dad, P) Aprop{P,fam, F) Aprop(F, addr. A) —¥prop(C, addr. A)) 



Fig. 3. Codification of potential Murdock laws (atoms mother, father and family have been 
abbreviated as mom, dad and fam) 

The competing sociological model is due to Adams [1]. After examining Latin 
American and some ethnic societies, Adams concluded that the evidence for the 
nuclear families as described by Murdock was “marginal at best” [14]. Instead he 
proposed the mother-child dyad as the primary unit. This new model is created by 
removing the father attribute, or merely disallowing its use in proofs. We also delete 
the father law mentioned in figure 3 from the law set. 

We bound the parameters by considering two unacceptable models at opposite 
extremes. The first is the “data” model. It uses neither theory nor laws to predict val- 
ues. It merely reflects the prior probability of any one value. The second is the “the- 
ory” model. It explicitly memorizes each value individually as a statement in the 
theory. It has neither general statements nor laws, and overfits the data. 

Table 1 gives the sizes of the each component of each total model. Both Mur- 
dock’s and Adams’ models must memorize adult addresses. Adams’ must also mem- 
orize those of children who live with their fathers but not mothers. The law sentences 
in figure 3 logically follow from theory so they have size 0. Unfortunately, the zero 
size forbids the constraining of the B parameter by this experiment. 

Table 1. Sizes of sociological models 



Model 


Abbr 


Theory Law 


Data 


data 


d 


0 


0 


107637 


Adams 


a 


240 


0 


79582 


Murdock 


m 


480 


0 


77739 


theory 


t 


960960 


0 


0 


Adams’ 


A 


240 23429 


77739 





Towards a Method of Searching a Diverse Theory Space for Scientific Discovery 



317 



Figure 4a gives the “phase diagram” plot of data. Where a model out scores all 
others its abbreviating letter appears in the parameter space. log2(C/A) is plotted on 
the X axis and log2(B/A) on the Y. 
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Fig. 4. Sociology model “phase diagrams” 



To place bounds on B we consider adding the father sentence to Adams’ law set. 
However, we cannot prove it from our theory. Therefore, we accept father in the 
model as a free variable with its (data-specified) prior probabilities. This results in a 
model with the equivalent predictive power of Murdock’s. It can now predict the 
addresses of children living with only their fathers. The price we pay is Shannon 
information cost of the prior probability of each usage of the 
prop(Child, father, Father) predicate for these predictions. See Adams’ in table 1. 
The revised “phase diagram” with Adams’ new model is plot in figure 4b. 

Geophysics Data 

We obtained data from the United States Geological Survey’s National Earthquake 
Information Center. We retrieved all recorded earthquakes in the catalog in a rectan- 
gular box from 139E to 162E and from 41N to 55N from 1976 to 2000. The Kuril 
subduction zone, the Japanese island of Hokkaido, and the Kuril island chain are the 
most prominent geophysical features in this area. Non-tectonic events were removed 
and the remaining ones were fit to a great circle. This great circle was taken to be the 
“length” of the fault and events greater than 512 km from it were removed. The time, 
distance-along-fault, (signed) distance-from-fault and depth of the remaining 11031 
events were entered into our earthquake database. 

Geophysics Model 

In the theory of plate tectonics, a subduction zone is a region where one (oceanic) 
plate sinks beneath another (continental) plate. A Wadati-Beniojf zone is the seismi- 
cally active portion of this interface [2] [23]. 
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A Wadati-Benioff zone may be modeled as a plane that increases in depth the 
further one goes into the continental plate. We did so by stating the assertions of fig- 
ure 5 in the theory where the slope and intercept were found by least-squares fit. 

Dis tFromfault — slope X depth + intercept 

inherit(kuril_quakes, slope, 1.05682). 

inherit(kuril_quakes,inter cept, -85.9936 km). 

Fig. 5. The theory of the planar Wadati-Benioff zone model. 

The law set was left empty. As before, the “data” model did not try to predict, and 
the “theory” model overfit by memorization. The results are given in Table 2 and are 
plotted in Figure 6a. 



Table 2. Sizes of geophysical models. 



Model 
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Theory 


Law 


Data 
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0 


97750 


planar 


P 
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0 


63904 


theory 
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1369230 


0 


9775 


aftershock 


a 


618 


13759 


63103 



The non- zero entry for the theory model’s for data size is due to round off error. 
That is, there is a slight difference between the decimal recording of the values logical 
assertions that comprise the theory (which have a fixed number of significant digits 
given by the precision of the values), and the binary recording of the values in the 
database. 
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Fig. 6. Geophysics model “phase diagrams” 

To place bounds on B we add a law to the planar model. When a particular after- 
shock labelling procedure is used there is an average of a 43.5 km distance between 
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an aftershock and its mainshock. Encoding this as a law permits better predictions of 
some distances. We include no theory to predict aftershocks, only an empirical proce- 
dure for labeling them after the fact. Therefore, we let mainshock be a free variable. 
The aftershock model results are given in Table 2 and in Figure 6B. 

We now evaluate our heuristic with the criteria in section 4. Recall, they were (1) 
generality, (2) ease of computation, (3) simplicity of form, and (4) smoothness. The 
function is general because it was applied to symbolic sociology and numeric geo- 
physics with equal ease, and because it has been applied to a domain where predic- 
tions have varying degrees of accuracy. Its ease of computation is limited by the 
ability to predict data, prove (or argue for) laws, and know data distributions. Also, its 
weighted sum form is simple. 

The function’s “smoothness,” its ability to give similar models similar scores, is 
limited by how honest people are with the law set. When some condition is true over 
the whole parameter space one could move it from theory to laws to avoid paying the 
syntactic cost. This is against the philosophy of this approach. Also, trying to estimate 
data distributions when there is little data may be a serious problem. Distributions 
may be used as “fudge factors” to vary a model’s score on the B/A axis. However, a 
potential advantage is that it will force such assumptions to be explicitly stated. 

We do not argue for one particular ratio for C/A or B/A. Rather, we seek a 
method for calibration. That said, we note that both geophysics and sociological had 
similar C/A bounds. Having B be too great may lead to “overfitting” the laws to the 
theory and ruling out yet unknown secondary effects. For discovery it may be best to 
fix A and C and let B vary as the model becomes more refined. This is another study. 

Note that this was truly a test of scientific rediscovery. Both the sociology and 
the geophysics theories were applied to new data. Neither Adams nor Murdock were 
trying to fit U.S. demographics for 1998. Benioff stated his hypothesis after examin- 
ing events from S. America and Hindu-Kish, not the Kurils. (Wadati probably had 
data for Honshu, not the Kurils.) 

6 Conclusion 

Scientists have different opinions on what the same data entails. To ignore that is to 
ignore the history of science. We have developed a heuristic function that takes some 
of these differences into account, and may be calibrated to a particular scientist, along 
our given axes. This heuristic function is a generalization of single model family 
parameter finding MML. It generalizes MML in a principled fashion to consider how 
much faith to put in laws versus data. Our approach also extends [23] to be applied to 
scientific discovery. It is general and has been applied to both symbolic and numeric 
scientific models. 
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We do not claim to have solved the whole scientific model preferencing problem. 
Serious limitations remain including (1) the specification of the original model prior 
probability, (2) the inhomogeneity of the search space, and (3) the fact that the “most 
probable” model is not necessarily the best one. The purpose of this heuristic is to 
help scientists identify interesting regions in the model space, i.e., models that are the 
immediate neighbors of their favorite models in the B/A-C/A plots. This is an initial 
step of an iterative process. 

Computer scientists might believe that a heuristic function could not sufficiently 
constrain search in a domain as rich as scientific discovery. However, the heuristic 
function is only part of the search algorithm. The search algorithm may employ rules 
to suggest when to apply scientific operators (e.g., [11]), or may use metalearning to 
discover which operators are best in a particular domain. Preliminary results from 
rediscovery in geophysics show that rules and metalearning may be combined or 
employed separately to significantly speed scientific discovery [20]. 
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Abstract. In this paper, we consider unsupervised clustering as a 
combinatorial optimization problem. We focus on the use of Local 
Search procedures to optimize an association coefficient whose aim is 
to construct a couple of conceptual partitions, one on the set of objects 
and the other one on the set of attribute-value pairs. We present a study 
of the variation of the function in order to decrease the complexity of 
local search and to propose stochastic local search. Performances of the 
given algorithms are tested on synthetic data sets and the real data set 
Vote taken from the UCI Irvine repository. 

Keywords: Unsupervised conceptual clustering, optimization proce- 

dure, local search. 



1 Introduction 

In the early steps of knowledge discovery from large databases, structuring data 
appears as a fundamental procedure which permits to better understand the 
data and to define groups with regards to an a priori similarity measure. This 
is usually referred to clustering in the unsupervised learning context. The data 
are composed of a set of objects described by a set of attributes such that each 
object owns a value on every attributes. In classification/regression, we have a 
target attribute which can be used to construct the groups. Knowledge discovery 
can be done through the learning of rules which explain the values on the target 
attribute using the other attributes. In this way, to each group of objects is 
associated a set of attribute- value pairs [Rak97]. When no prior information is 
available, clustering procedures can be used to discover the underlying structure 
of the data. They construct a partition on the set of objects such that most 
similar objects belong to a same cluster whereas most dissimilar ones belong to 
different groups. Hence, those procedures synthesize the data into few clusters. 

One of the key points in clustering is the a priori definition of similarity. 
When dealing with numerical attributes, it is usual to relate the similarity be- 
tween two objects with their distance. Clustering is then reduced to the deter- 
mination of groups minimizing the intra-cluster similarity and maximizing the 
inter-clusters one. For instance, in the K-MEANS algorithm [JD88,CDG"''88], 
Euclidean distances between representative vectors of objects are used. This can 



K.P. Jantke and A. Shinohara (Eds.): DS 2001, LNAI 2226, pp. 323—335, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




324 



C. Robardet and F. Feschet 



also be extended to ordinal data and even to symbolic one but distances be- 
come less representative in this case. Instead, probabilistic representations are 
preferred. The difference between the probabilities of appearance of an attribute- 
value pair on the whole set of objects and its restriction on the set of objects 
belonging to a particular cluster is used to guide the search for a good partition. 
It is a trade-off between intra-class similarity and inter-class dissimilarity of the 
objects. For example, in the COBWEB algorithm [Fis87,Fis96], the category 
utility function is used as an objective function. It is a weighted averaging of the 
well known GINI index without fixing the number of clusters. Other methods 
like AUTOCLASS [CS96] also use bayesian classification, modeling objects by 
finite mixture distributions. 

Another key point in clustering is the optimization procedure. The cardinality 
of the set of all possible partitions increases exponentially with the size n of 
the set of objects, which leads to use fast but often rough heuristics. In the 
K-MEANS algorithm, a heuristic based on the principle of reallocation, is used. 
At each step, cluster centroids are computed and each object is assigned to the 
cluster whose centroid is the closest. After few such steps the procedure stops 
to improve the partition. But unfortunately, the algorithm makes only local 
changes to the initial partition and thus typically gets trapped in the first local 
minimum. COBWEB method uses an incremental procedure which classifies 
objects one by one. For each object, the procedure evaluates the two following 
options: classifying the object in one of the existing clusters or creating a new one 
containing only this object. The operation which leads to the most important 
increase in the function is considered. The main drawback of this heuristic is that 
it often constructs a local optimum which is dependant on the order of the objects 
in the incremental process. In AUTOCLASS, optimization is done for maximum 
posterior parameters (MAP) with the EM algorithm. In fact, among a set of 
models, constituted of a priori number of clusters and probability distributions 
functions, the method consists in estimating some parameters using the EM 
algorithm and choosing the best model using a MAP estimator. 

Optimization can be global or local. The first one is usually unreachable 
and the second is very sensitive to initial conditions. Popular methods like Tabu 
Search or Genetic Algorithm are widely used without knowing clearly how they 
work. In this paper, we restrict to local optimization procedure and more pre- 
cisely on the simplest one that is the local search procedure. Local optimization 
seems to be a promising method for clustering since it has provide good results 
at a low cost in lots of combinatorial optimization problems. We base our study 
on a variational approach of an objective function which is described in section 
2. Variations of the function through elementary modifications are studied in 
section 3 where a single model of modification is given. This permits us to intro- 
duce five stochastic optimization procedures which are experimentally studied 
on two different data sets. The first one is an artificial data set and the second 
one is the Vote data from the UCI Irvine repository. We then propose some 
conclusions and future works. 
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2 Clustering Method 

To strengthen the semantic knowledge held by partitions, we study an algorithm 
for the construction of two linked partitions, one on the set of objects and the 
other one on the set of attribute-value pairs; we call this couple a bi-partition. 
Similar methods have already been proposed. We can cite the methods of data 
reorganization [MSW72,SCH75] which consist in permuting rows and columns 
of a data table on the base of a distance to minimize. Another one is the simulta- 
neous clustering algorithm [Gov84]. It consists in searching a couple of partitions 
in a priori K and L clusters and an ideal binary table of dimensions K x L such 
that the gap between the initial data table structured by the two partitions and 
the ideal table is minimized. Those two procedures have important drawbacks. 
The first methods do not produce partitions which must be constructed by the 
user. The second one determines a couple of partitions with a priori fixed num- 
bers of clusters. Furthermore, the resulting couple of partitions is often far from 
the global optimum. 

To enforce the knowledge contribution brought by the bi-partition, we favor 
couples of partitions which follow the following property. 

Property. The functional link, which restores one partition on the basis of the 
knowledge of the second one, must be as strong as possible. Furthermore, both 
partitions must have the same numbers of clusters. 

To evaluate the quality of a bi-partition regarding this property, we construct 
a function over Vo x Vq, where Vo is the set of partitions on the set of objects, 
and Vq is the set of partitions on the set of attribute-value pairs. This func- 
tion must follow some properties [Rak97,RF01] to be adapted to the clustering 
structure, such as the independence upon clusters permutations or the ability to 
treat bi-partitions having partitions with different numbers of clusters, etc. The- 
ses properties are partially checked by association measures, which have been 
built to evaluate the link between two qualitative attributes X and Y , which are 
considered as partitions upon a same set. The association measures are widely 
used in supervised clustering [LdC96], whereas few unsupervised clustering al- 
gorithms used them [MH91]. We propose [RFOl] to use an adaptation of the Tf, 
measure construct by Goodman and Kruskal [GK54], which we call tq, 

y y ^ _y 2 

We name to the above measure obtained when exchanging the attributes^. 
We denote by pi, (resp. p,j) the frequency estimator of the probability associated 
to the attribute- value pair i (resp. j) of the X (resp. Y) attribute, and by pij 
the frequency estimator of the probability that an attribute-value pair i of the 
attribute X, and the attribute-value pair j of Y arisen simultaneously. The tq 

^ TO is used to determine an adequate partition on Vo and tq is used to obtain an 
adequate one on "Pq 
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coefficient evaluates the proportional reduction in error given by the knowledge 
of the attribute X on the prediction of Y . It takes into account all the structure 
of the distribution when estimating the variation on the prediction. Using this 
measure, we do not need to fix the number of clusters in the partitions. It mea- 
sures how the knowledge of the partition P of Vo improve the prediction of the 
cluster of an attribute- value pair in a partition Q of Vq, knowing the cluster (s) 
of P which possess objects described by the attribute- value pair. The measure is 
normalized and consequently none of the discrete or the single-cluster partitions 
are favored. Moreover, some experiments have been realized by M. Olszak [01s95] 
and also by us [RFOl]. They consist in comparing several association measures 
with regard to different synthetic data sets. In both studies, the authors find 
that the tq has an appropriate behavior. 

To overcome the fact that our two partitions are not based on a same set, we 
build a co-occurrence table. In the data, each object is described by h attributes 
Vi such that Vi : O ^ dom^. Q = dom* is the set of all attribute- value 
pairs, differentiating each attribute value of the different attributes. The co- 
occurrence table between a partition P = (Pi, . . . , Pk) on the set O of objects 
and a partition Q = (Qi, ■ • ■ , Qk) on the set Q, is (n^)^ ^ with 

h 

^ ^ ^Vi(x),y 
x^Pi y^Qj i—1 

where (5 is the Kronecker^ symbol. Consequently, we replace the previous pij 
(resp. Pi,) notation by p- (resp. where ni, = J2j i^ij and n,, = J2j 

To determine the best bi-partition, we search a bi-partition which maximizes 
the TQ and tq measures. The problem is now to find an adequate optimization 
procedure, remembering that we are confronted to a combinatorial optimization 
problem. Note that the search space Vq x Vq is huge (exponential in n) 



m I c 
c=l i=l 



with tt (X) = m and X = {O, Q} 



Consequently exhaustive or potentially exhaustive search procedures, like 
the Branch and Bound, is unrealistic in terms of time efficiency. Using others 
procedures, we have no guarantees the obtained solution is a global optimum. 
Choosing a local optimization method is a trade-off between computation cost 
and quality of the result. 



3 Local Search 

We consider general purpose methods which are based on the definition of the 
neighborhood of a given partition. At each step, a new solution is chosen among 
the neighborhood of the previous one, such that the algorithm converges towards 

Svi(x),y = 1 if Vi (x) = y, 5vi(x),y ~ 0 otherwise 



2 
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at least a local optimum. Generating several possible solutions at each step allows 
to direct the search to the candidates which most improve the function. The main 
difficult point is to determine how to construct an efficient neighborhood suffi- 
ciently rich and with a tractable complexity. Recent works [FK00,GKLN00] at- 
tempt to apply Local Search algorithm to clustering problem. [GKLNOO] propose 
six operators for generating a partition starting from another. They apply those 
operators first successively, and then stochastically following their frequency of 
improving the function. They observe that the second algorithm is more ro- 
bust than the first one. [FKOO] couples together Local Search and K-MEANS 
algorithms. The neighborhood function consists in randomly swapping a cluster 
centroid by another object and then applying the K-MEANS procedure. This 
procedure is less dependant on the initialization of the algorithm and provide ro- 
bust results. Both papers introduce randomness in the generating neighborhood 
process and observe increase in the quality of the results. 

Local Search is often compared with Tabu Search, Genetic Algorithms, and 
Simulated Annealing which attempt to obtain a possibly global optimum with- 
out visiting all possible solutions. Tabu Search consists in choosing a better 
solution than the current one when it exists, and to accept sub-optimal solu- 
tion otherwise. A Tabu list prevents to return to a candidate recently evaluated. 
The procedure can thus pass through local optimum but often with a high com- 
puting time. Simulated Annealing relies on a stochastic process which allows to 
escape from local optima. Solutions which improve the objective function are 
not necessarily kept. The selection process consists in taking solutions regarding 
their associated probability. This probability increases for solutions improving 
the function. But the probability is also influence by a global parameter called 
temperature which gradually decreases to force the convergence of the algorithm 
to an optimum. Whereas other methods generate a unique new solution at each 
step, the particularity of Genetic Algorithms [Rud94] is to generate a set of best 
solutions, called population, at each step. The neighborhood of the population 
is defined using genetic operators such as reproduction, mutation and crossover. 
New candidates which surpass their parents are always maintained, which guar- 
antees the convergence to a good solution. [BRE91,Gol98] apply such algorithms 
to clustering problems. 



4 Variational Approach 

For using local optimization procedures we usually define operators. Then we 
apply them on the current solution to generate the neighborhood. After that, 
we compute the measure on each member of the neighborhood and compare the 
value with the one obtained on the current solution. This procedure is expensive 
in memory space used and computing time. In our problem, computing the 
measure on a new partition might require to duplicate the co-occurrence table 
and consequently to double the memory space used, which is a drawback for 
the scalability of the method. Furthermore, the complexity for evaluating the tq 
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measure is in O {p x q), with p denoting the number of clusters of P and q the 
ones of Q. This cost is multiplied by the cardinal of the neighborhood. 

To overcome those drawbacks, we propose a variational approach for evaluat- 
ing the objective function. We define three operators for generating neighboring 
partitions. Those operators are the transfer of one element from a cluster to 
another, the split of a cluster into two and the merging of two clusters into one. 
Those operators constitute a complete generating system because what ever the 
current partition is, we can reach each of the other ones by applying a finite 
number of such operators. We evaluate the variation on the tq measure when 
modifying the current partition by one of the three operators. 

We first consider the variation on tq when transferring, on the partition Q, 
one attribute-value pair y from a group denoted by b to another denoted by e. 
Given than each cluster of Q is linked to a column of the co-occurrence table, 
the transfer of y from Qi, to Qe generates the moving of a quantity from the 
cell on row i and column b to the one on row i and column e. Let us denote 
by Uij the elements of the old co-occurrence table, and by those of the new 
one. The transfert of y induces the following equations between n^- and mij 

mib = Uib - Af ; mie = Uie + \\ 

rriij = Uij otherwise (1) 



The variation of tq given by the transfert is then 

2 2 
i n,- n 



^old ^new 
Tn — Tn = 



i-w ^ 



J rij.n, 



-E- ^ 



- T — ^ ■ 

1 “ Ej 



Simplifying using equations (1), we obtain 



^old ^new 



I X 



(e. 



2A? 



n-ie - Af]^ + C X [n.e - n,b + A«]^ 



~ ;t 2"A*^/ (n.e — n,b + \y) 



where A^ = Ei I and C are the following constants with respect to b 

and e, 

/=1-V^ C'=1-VV'ZE_ 

^ ' n^ ^ ^ ' n, n 

j ■■ i J ■■ 

The transfer of several attribute- value pairs in a same movement leads to the 
same expression. Indeed, considering the transfer of a set S of attribute-value 
pairs, we compute (Af ) vectors as follows 



= ^Vi(x),y and Af = ^ Af 

xGPi i—1 y^S 

Consequently, (Af ) vectors are linear combinations of the (Af), and transferring 
a single attribute- value pair or a set of them is evaluated by the same expression. 
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Furthermore, the fusion of two clusters into a single one can be considered 
as a transfer of all attribute-value pairs of a cluster into another one, and thus 
leads to empty the first cluster. The computational expression is similar of the 
transfer’s one. When two columns b and e are merged into the e one, we have 
the following expression. 



^old ^new 



P-l\X^n, 




Splitting a cluster into two is also a transfer like operation. It can be view as 
a transfer of a set S of attribute-value pairs into a new cluster. When a column 
b is split into an e and b ones, the variation of tq is: 



^old ^new 



P - {\s - n.fc) 



Similar expressions are found when moving a subset of objects for one row to 
another on the tq measure. 

Through the above expressions we show that the variation on the tq mea- 
sure can be evaluated using the co-occurrence table, for the evaluation of the nij 
parameters, and the data table, for the computing of the Xi expressions. The par- 
tition itself is not taken into account for computing the variations. Furthermore, 
we have shown that the three different operators lead to a unique expression 
of the TQ variation, which we denote by A ((Xf) ,b,e). The fusion and merge 
are particular cases of the transfer modification. Computing A ((Xf) , b, e) has a 
lower computational complexity than evaluating Tq‘^ — Tq^'^ . In the variational 
approach, the evaluation of the first partition is in O {p x q) because we need 
to compute the constant C. Then, when the constants I and C are fixed, the 
complexity for evaluating a new partition is in O (max(p, q)). When we need to 
upgrade the constants / and C, it takes 0(1) and O (p) respectively. Conse- 
quently, we reduce the complexity from O {p x q) to O (max{p,q)), except for 
the first evaluation. 

Globally, the dimension of the problem is reduced. It is now expressed as a 
function of the elementary vectors (Af), with Af = Yl,x^p ^Vi{x),y for all 

attribute- value pairs y. All (Af ) vectors can be generate from the elementary 
vectors (A)') as follows 



yeQ 

The problem is now to find a way to determine the (Af ) vectors which lead 
to the most important increase in the measure. In the next section, we propose 
five algorithms which differ from their way to choose such vectors. 
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5 Algorithms 

Using the previous variational approach into a Local Search procedure leads to 
the following deterministic algorithm 

For each cluster Pt do 

For each attribute-value pair y of Pb do 
For each cluster Pe ^ Pb do 

Compute A ((A^) , b, e) 

End For 

If (minZ\ < 0) then 

Modify the co-occurrence table 

End For 

End For 



At each step, we consider one attribute- value pair per cluster and try to transfer 
it to another cluster. We modify the co-occurrence table for the transfer with 
highest negative decrease. 

It is well known that randomness usually increases the performance of deter- 
ministic algorithm. Stochastic optimization can be considered as a random walk 
above the set of all partitions. If this search is guided to be attracted by high 
values of some measure on the partitions, the probability to visit the partitions 
with global maximum value are increased [FKOO] . We thus propose four random- 
ized versions depending on which For loop is randomized in the deterministic 
version. 



Stochastic 1 algorithm 

Randomly choose a cluster Pb 
Randomly choose y, 

an attribute-value pair 
For each cluster Pe yf Pb do 
Compute A ((A^) , 6, e) 

End For 

If (min Z\ < 0) then 

Modify the co-occurrence table 



Stochastic 2 algorithm 

Randomly choose a cluster Pb 
Randomly choose a subset S in Pb 
For each cluster Pe yf Pb do 
Compute A ((Af ) , b, e) 

End For 

If (min Z\ < 0) then 
Modify the co-occurrence table 



Stochastic 3 algorithm 

For each cluster Pb do 

Randomly choose y in Pb 
For each cluster Pe yf Pb do 
Compute A ((Af) , 6, e) 

End For 

If (min Z\ < 0) then 
Modify the co-occurrence table 
End For 



Stochatic 4 algorithm 

For each cluster Pb do 

Randomly choose a subset S in Pb 
For each cluster Pe yf Pb do 
Compute A ((Af ) , 6, e) 

End For 

If (min Z\ < 0) then 
Modify the co-occurrence table 
End For 
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Those algorithm are several combinations between randomness and deter- 
ministic choice of the cluster and the attribute- value pair(s) to modify. Note 
that even and odd versions differ by the choice of one or a subset of attribute- 
value pairs. In the two first algorithms, the cluster, from which j/ or 5 is removed, 
is chosen randomly, whereas for the two last ones all the clusters are examined. 
In all those algorithms, the best ended cluster is chosen after examining all the 
possible ones. 

6 Experimentation 

To optimize a bi-partition, we successively execute the algorithm with tq as ob- 
jective function which leads to improve the partition Q of Vq and then we apply 
the same algorithm with tq objective function and thus improve the partition P 
of Vo- We must underline the fact that modifying Q (resp. P) greatly influences 
Tq (resp. To) and in a less extent influences also to (resp. tq). This explains the 
fact that on some of the following graphics we observe a decrease in the measure. 

We first apply those algorithms on a perfect synthetic data set which con- 
tains 200 objects and 30 attributes with 5 different values each. This data set is 
composed of 5 blocks of homogenous data, composing a bi-partition into 5 clus- 
ters. Starting from the discrete partition (Fig.l left) or from a random partition 
(Fig.l droite), we apply the five algorithms on this data set. On Fig.l, the value 
of TQ is plotted at each step. 




Fig. 1. Perfect synthetic data set, starting from the discrete partition (left), or from a 
random one (right) 



On the synthetic data set, we observe that the deterministic and the third 
stochastic procedures find in fewer steps the optimal partition than the other 
procedures. This can be explained by the fact that in those procedures all possible 
clusters Pb and all possible cluster Pe are evaluated and that at each step the best 
movement for a given single y is chosen. The first stochastic procedure is also 
really impressive. When the first partition is the discrete one (see Fig.l left), it 
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has the same behavior than the deterministic procedure. When the first partition 
is constructed randomly (see Fig.l right), it takes more steps to find the goal 
partition. The second and the fourth stochastic procedures are the slowest. They 
rely on a randomly choice of a subset of attribute- value pairs. When the subset is 
composed of dissimilar attribute-value pairs, the procedure can not improve the 
value of the measure. This explains the fact that those procedures have better 
performances on the left graphic, when the first partition is the discrete one and 
consequently the possible subsets S are of small cardinality. 

To simulate a more realistic case, we randomly introduce some noise in the 
data set (see Fig. 2 which shows the tq value for each iteration with 10% (left) 
and 30% (right) of random noise). 






Fig. 2. Synthetic Data set with 10% noise (left) and 30% noise (right) 



The results obtained are similar to those found in the perfect case. The 
convergence speeds are in the same order. 

The previous graphics mask an important point: the required time for each 
step. The table 1 gathers the computation time, expressed in seconds, used for 
10000 iterations. For information, they are obtained on a Pentium II 300Mhz 
with 32 Mb memory. 



Table 1. Computation time (in second) used by the several algorithms on different 
data sets 





Perfect 


10% noise 


30 % noise 


detreminite 


204 


250 


275 


Stochastic 1 


2.66 


4 


4 


Stochastic 2 


23 


31 


31 


Stochastic 3 


39 


130 


130 


Stochastic 4 


81 


no 


120 
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The deterministic procedure is very high time consuming. The first stochastic 
procedure seems to be a good compromise between accuracy and time consump- 
tion. 

Then we apply the algorithms on a well known benchmark: 1984 United 
States Congressional Voting Records Database. We remove the attribute ex- 
pressing the vote. On Fig. 3 (left) is plotted the tq values for each iteration of 
the algorithms. On the contrary of the previous experiences, the distinction be- 
tween on one hand the second and the fourth stochastic procedures, and on the 
other hand the other procedures, is less obvious. All the procedures find quite 
the same partition. We can observe an unexpected decrease in the function. This 
is due to the fact that an increase on tq leads to a decrease on tq. Such phe- 
nomenon appears rarely, and when it appears, the algorithm quickly restores a 
better partition. Consequently this is not an handicap in the optimization pro- 
cess. To visualize the influence of the tq optimization on the ry one, we plot 
(see Fig. 3 (right)) the value of the both functions at each iterations. On this 
graph we clearly observe the compensation process on the optimization of both 
functions. 




Fig. 3. Vote Data Set (left). Values of to and tq when using Stochastic 1 (right) 



The quality of the obtained partition of voters can be evaluated through its 
comparison with the results of the elections. This election consists in deciding 
between a democrat or republican congress. We also obtain a partition in two 
clusters. The table 2 crosses the two partitions. 

Table 2. Cross table of the votes and the obtained results by the algorithm 4 



Our results vs Vote 


Democrat 


Republican 


# 


Pi 


221 


14 


235 


P2 


46 


154 


200 


B 


267 


168 


435 
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The group P\ of the obtained partition seems obviously corresponds to the 
democrat one, whereas the group P 2 fits the republican population. The rate of 
accurate prediction is here of 86.2% whereas about 90% accuracy appears to be 
stagger’s asymptote. 

The quality of the partition on the set of attribute-value pairs is also very 
good. We denote by Gi the cluster of attribute-value pairs associated to the 
group Pi, which fits well the democrat population. This set gathers all the 
attribute- value pairs whose conditional probability of appearance, given the fact 
the voter is Democrat, are superior of the ones associated to the Republican 
voters. The probability of being democrat knowing the voter owns this attribute- 
value pair is also superior of the one obtained for the republican voters and thus 
for all the attribute-value pairs. All attributes are of binary/type (yes/no), and 
for each attribute, the yes value belong to a cluster and the no one to another 
one. Gonsequently, we can say that the obtained partition is ideal regarding our 
criteria of a good partition. 

7 Conclusion 

In this article, we have presented a variational study of a function used for guid- 
ing the search of a partition in conceptual clustering. It consists in evaluating the 
variation of the function when transfer, merge or split operators are applied to 
modify a partition. We showed that using this approach in optimization proce- 
dure allows to decrease the computational cost. Furthermore, it leads to simplify 
the problem, expressing the three operators under a single one. 

We mix this approach with stochastic local search optimization procedures 
and apply them on a synthetic data set and the real data set Vote taken from 
the UGI Irvine repository. The experimentation leads to conclude that some 
randomness is needed in the local search procedure to speed up the convergence 
to the best partition. But too much randomness, when the procedure examine a 
random subset of attribute- value pairs of a cluster, slow down the convergence in 
a more important way. The partitions obtained on the Vote data set are both of 
excellent accuracy. The partition on the voters set is quite the same than the one 
given by the result of the election, without taking this information into account. 
The partition on the set of attribute-value pairs follows exactly the conditional 
probabilities of appearance of those attribute-value pairs given the vote class. 

In a future work, we plan to analytically approximate the combination of 
(Af) which most improve the quality of the partition. This would reduce the 
number of steps of optimization required to obtain an optimum. 
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Abstract. Research on the computational discovery of numeric equa- 
tions has focused on constructing laws from scratch, whereas work on 
theory revision has emphasized qualitative knowledge. In this paper, we 
describe an approach to improving scientific models that are cast as sets 
of equations. We review one such model for aspects of the Earth ecosys- 
tem, then recount its application to revising parameter values, intrinsic 
properties, and functional forms, in each case achieving reduction in er- 
ror on Earth science data while retaining the communicability of the 
original model. After this, we consider earlier work on computational 
scientific discovery and theory revision, then close with suggestions for 
future research on this topic. 



1 Research Goals and Motivation 

Research on computational approaches to scientific knowledge discovery has a 
long history in artificial intelligence, dating back over two decades (e.g., Lan- 
gley, 1979; Lenat, 1977). This body of work has led steadily to more powerful 
methods and, in recent years, to new discoveries deemed worth publication in 
the scientific literature, as reviewed by Langley (1998). However, despite this 
progress, mainstream work on the topic retains some important limitations. 

One drawback is that few approaches to the intelligent analysis of scientific 
data can use available knowledge about the domain to constrain search for laws 
or explanations. Moreover, although early work on computational discovery cast 
discovered knowledge in notations familiar to scientists, more recent efforts have 
not. Rather, influenced by the success of machine learning and data mining, many 
researchers have adopted formalisms developed by these fields, such as decision 
trees and Bayesian networks. A return to methods that operate on established 
scientific notations seems necessary for scientists to understand their results. 
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Like earlier research on computational scientific discovery, our general ap- 
proach involves defining a space of possible models stated in an established 
scientific formalism, specifically sets of numeric equations, and developing tech- 
niques to search that space. However, it differs from previous work in this area by 
starting from an existing scientific model and using heuristic search to revise the 
model in ways that improve its fit to observations. Although there exists some 
research on theory refinement (e.g., Ourston & Mooney 1990; Towell, 1991), 
it has emphasized qualitative knowledge rather than quantitative models that 
relate continuous variables, which play a central role in many sciences. 

In the pages that follow, we describe an approach to revising quantitative 
models of complex systems. We believe that our approach is a general one ap- 
propriate for many scientific domains, but we have focused our efforts on one 
area - certain aspects of the Earth ecosystem - for which we have a viable model, 
existing data, and domain expertise. We briefly review the domain and model 
before moving on to describe our approach to knowledge discovery and model 
revision. After this, we present some initial results that suggest our approach can 
improve substantially the model’s fit to available data. We close with a discussion 
of related discovery work and directions for future research. 

2 A Quantitative Model of the Earth Ecosystem 

Data from the latest generation of satellites, combined with readings from ground 
sources, hold great promise for testing and improving existing scientific models of 
the Earth’s biosphere. One such model, CASA, developed by Potter and Klooster 
(1997, 1998) at NASA Ames Research Center, accounts for the global produc- 
tion and absorption of biogenic trace gases in the Earth atmosphere, as well as 
predicting changes in the geographic patterns of major vegetation types (e.g., 
grasslands, forest, tundra, and desert) on the land. 

CASA predicts, with reasonable accuracy, annual global fluxes in trace gas 
production as a function of surface temperature, moisture levels, and soil prop- 
erties, together with global satellite observations of the land surface. The model 
incorporates difference equations that represent the terrestrial carbon cycle, as 
well as processes that mineralize nitrogen and control vegetation type. These 
equations describe relations among quantitative variables and lead to changes in 
the modeled outputs over time. Some processes are contingent on the values of 
discrete variables, such as soil type and vegetation, which take on different val- 
ues at different locations. CASA operates on gridded input at different levels of 
resolution, but typical usage involves grid cells that are eight kilometers square, 
which matches the resolution for satellite observations of the land surface. 

To run the CASA model, the difference equations are repeatedly applied to 
each grid cell independently to produce new variable values on a daily or monthly 
basis, leading to predictions about how each variable changes, at each location, 
over time. Although CASA has been quite successful at modeling Earth’s ecosys- 
tem, there remain ways in which its predictions differ from observations, suggest- 
ing that we invoke computational discovery methods to improve its ability to fit 
the data. The result would be a revised model, cast in the same notation as the 
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Table 1. Variables used in the NPPc portion of the CASA ecosystem model. 



NPPc is the net plant production of carbon at a site during the year. 

E is the photosynthetic efficiency at a site after factoring various sources of stress. 

T1 is a temperature stress factor (0 < T1 < 1) for cold weather. 

T2 is a temperature stress factor (0 < T2 < 1), nearly Gaussian in form but falling 
off more quickly at higher temperatures. 

W is a water stress factor (0.5 < TV < 1) for dry regions. 

Topt is the average temperature for the month at which MON-FAS-NDVI takes on 
its maximum value at a site. 

Tempc is the average temperature at a site for a given month. 

EET is the estimated evapotranspiration (water loss due to evaporation and transpi- 
ration) at a site. 

PET is the potential evapotranspiration (water loss due to evaporation and transpi- 
ration given an unlimited water supply) at a site. 

PET-TW-M is a component of potential evapotranspiration that takes into account 
the latitude, time of year, and days in the month. 

A is a polynomial function of the annual heat index at a site. 

AHI is the annual heat index for a given site. 

MON-FAS-NDVI is the relative vegetation greenness for a given month as measured 
from space. 

IPAR is the energy from the sun that is intercepted by vegetation after factoring in 
time of year and days in the month. 

FPAR-FAS is the fraction of energy intercepted from the sun that is absorbed pho- 
tosynthetically after factoring in vegetation type. 

MONTHLY-SOLAR is the average solar irradiance for a given month at a site. 

SOL-CONVER is 0.0864 times the number of days in each month. 

UMD-VEG is the type of ground cover (vegetation) at a site. 



original one, that incorporates changes which are interesting to Earth scientists 
and which improve our understanding of the environment. 

Because the overall CASA model is quite complex, involving many variables 
and equations, we decided to focus on one portion that lies on the model’s 
‘fringes’ and that does not involve any difference equations. Table 1 describes the 
variables that occur in this submodel, in which the dependent variable, NPPc, 
represents the net production of carbon. As Table 2 indicates, the model predicts 
this quantity as the product of two unobservable variables, the photosynthetic 
efficiency, E, at a site and the solar energy intercepted, IPAR, at that site. 

Photosynthetic efficiency is in turn calculated as the product of the maximum 
efficiency (0.56) and three stress factors that reduce this efficiency. One stress 
term, T2, takes into account the difference between the optimum temperature, 
Topt, and actual temperature, Tempc, for a site. A second factor, Tl, involves 
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Table 2. Equations used in the NPPc portion of the CASA ecosystem model. 



NPPc = (E • IPAR, 0) 

E = 0.56 • T1 • T2 • W 

T1 = 0.8 + 0.02 • Topt - 0.0005 • Topt^ 

T2 = 1.18/[(1 + _|_ g0.3-(Tempc-Topt-10)^j 

W = 0.5 + 0.5 • EET/PET 

PET = 1.6 • (10 • Tempo / AHI)^ • PET-TW-M if Tempo > 0 
PET = 0 if Tempo < 0 

A = 0.000000675 • AHI® - 0.0000771- AHI^ + 0.01792 • AHI + 0.49239 
IPAR = 0.5 • FPAR-PAS • MONTHLY-SOLAR • SOL-CONVER 
FPAR-FAS = min((SR-FAS - 1.08)/SRDIFF(UMD-VEG), 0.95) 

SR-FAS = - (MON-FAS-NDVI + 1000) / (MON-FAS-NDVI - 1000) 



the nearness of Topt to a global optimum for all sites, reflecting the intuition 
that plants which are better adapted to harsh temperatures are less efficient 
overall. The third term, W, represents stress that results from lack of moisture as 
reflected by EET, the estimated water loss due to evaporation and transpiration, 
and PET, the water loss due to these processes given an unlimited water supply. 
In turn, PET is defined in terms of the annual heat index, AHI, for a site, and 
PET-TW-M, another component of potential evapotranspiration. 

The energy intercepted from the sun, IPAR, is computed as the product 
of FPAR-FAS, the fraction of energy absorbed photosynthetically for a given 
vegetation type, MONTHLY-SOLAR, the average radiation for a given month, 
and SOL-CONVER, the number of days in that month. FPAR-FAS is a function 
of MON-FAS-NDVI, which indicates relative greenness at a site as observed from 
space, and SRDIFF, an intrinsic property that takes on different numeric values 
for different vegetation types as specified by the discrete variable UMD-VEG. 

Of the variables we have mentioned, NPPc, Tempc, MONTHLY-SOLAR, 
SOL-CONVER, MON-FAS-NDVI, and UMD-VEG are observable. Three addi- 
tional terms - EET, PET-TW-M, and AHI - are defined elsewhere in the model, 
but we assume their definitions are correct and thus we can treat them as observ- 
ables. The remaining variables are unobservable and must be computed from the 
others using their definitions. This portion of the model also contains a number 
of numeric parameters, as shown in the equations in Table 2. 

3 An Approach to Quantitative Model Revision 

As noted earlier, our approach to scientific discovery involves refining models 
like CASA that involve relations among quantitative variables. We adopt the 
traditional view of discovery as heuristic search through a space of models, with 
the search process directed by candidates’ ability to fit the data. However, we 
assume this process starts not from scratch, but rather with an existing model. 





340 



K. Saito et al. 



and the search operators involve making changes to this model, rather than 
constructing entirely new structures. 

Our long-term goal is not to automate the revision process, but instead to 
provide an interactive tool that scientists can direct and use to aid their model 
development. As a result, the approach we describe in this section addresses 
the task of making local changes to a model rather than carrying out global 
optimization, as assumed by Chown and Dietterich (2000). Thus, our software 
takes as input not only observations about measurable variables and an existing 
model stated as equations, but also information about which portion of the 
model should be altered. The output is a revised model that fits the observed 
data better than the initial one. 

Below we review two discovery algorithms that we utilize to improve the 
specified part of a model, then describe three distinct types of revision they 
support. We consider these in order of increasing complexity, starting with simple 
changes to parameter values, moving on to revisions in the values of intrinsic 
properties, and ending with changes in an equation’s functional form. 

3.1 The RF5 and RF6 Discovery Algorithms 

Our approach relies on RF5 and RF6, two algorithms for discovering numeric 
equations described Saito and Nakano (1997, 2000). Given data for some contin- 
uous variable y that is dependent on continuous predictive variables xi, . . . , Xn, 
the RF5 system searches for multivariate polynomial equations of the form 

j K j / K \ 

y = wo + '^WjY\_x'^''‘ = wo + ^Wj-exp ^Wjfcln(a;fc) , (1) 

j — 1 k—1 j — 1 Vfc— 1 / 

Such functional relations subsume many of the numeric laws found by previous 
computational discovery systems like Bacon (Langley, 1979) and Fahrenheit 
(Zytkow, Zhu, & Hussam, 1990). 

RF5’s first step involves transforming a candidate functional form with J 
summed terms into a three-layer neural network based on the rightmost form 
of expression (1), in which the K hidden nodes in this network correspond to 
product units (Durbin & Rumelhart, 1989). The system then carries out search 
through the weight space using the BPQ algorithm, a second-order learning tech- 
nique that calculates both the descent direction and the step size automatically. 

This process halts when it finds a set of weights that minimize the squared 
error on the dependent variable y. RF5 runs the BPQ method on networks with 
different numbers of hidden units, then selects the one that gives the best score 
on an MDL metric. Finally, the program transforms the resulting network into 
a polynomial equation, with weights on hidden units becoming exponents and 
other weights becoming coefficients. 

The RF6 algorithm extends RF5 by adding the ability to find conditions on 
a numeric equation that involve nominal variables, which it encodes using one 
input variable for each nominal value. To this end, the system first generates one 
such condition for each training case, then utilizes k-means clustering to generate 
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a smaller set of more general conditions, with the number of clusters determined 
through cross validation. Finally, RF6 invokes decision-tree induction to con- 
struct a classifier that discriminates among these clusters, which it transforms 
into rules that form the nominal conditions on the polynomial equation that 
RF5 has generated. 

3.2 Three Types of Model Refinement 

There exist three natural types of refinement within the class of models, like 
CASA, that are stated as sets of equations that refer to unobservable variables. 
These include revising the parameter values in equations, altering the values for 
an intrinsic property, and changing the functional form of an existing equation. 

Improving the parameters for an equation is the most straightforward pro- 
cess. The NPPc portion of CASA contains some parameterized equations that 
our Earth science team members believe are reliable, like that for computing the 
variable A from AHI, the annual heat index. However, it also includes equations 
with parameters about which there is less certainty, like the expression that pre- 
dicts the temperature stress factor T2 from Tempo and Topt. Our approach to 
revising such parameters relies on creating a specialized neural network that en- 
codes the equation’s functional form using ideas from RF5, but also including a 
term for the unchanged portion of the model. We then run the BPQ algorithm to 
find revised parameter values, initializing weights based on those in the model. 

We can utilize a similar scheme to improve the values for an intrinsic property 
like SRDIFF that the model associates with the discrete values for some nominal 
variable like UMD-VEG (vegetation type). We encode each nominal term as a 
set of dummy variables, one for each discrete value, making the dummy variable 
equal to one if the discrete value occurs and zero otherwise. We introduce one 
hidden unit for the intrinsic property, with links from each of the dummy vari- 
ables and with weights that correspond to the intrinsic values associated with 
each discrete value. To revise these weights, we create a neural network that in- 
corporates the intrinsic values but also includes a term for the unchanging parts 
of the model. We can then run BPQ to revise the weights that correspond to 
intrinsic values, again initializing them to those in the initial model. 

Altering the form of an existing equation requires somewhat more effort, but 
maps more directly onto previous work in equation discovery. In this case, the 
details depend on the specific functional form that we provide, but because we 
have available the RF5 and RF6 algorithms, the approach supports any of the 
forms that they can discover or specializations of them. Again, having identified 
a particular equation that we want to improve, we create a neural network 
that encodes the desired form, then invoke the BPQ algorithm to determine 
its parametric values, in this case initializing the network weights randomly. 

This approach to model refinement supports changes to only one equation or 
intrinsic property at a time, but this is consistent with the interactive process 
described earlier. We envision the scientist identifying a portion of the model 
that he thinks could be better, running one of the three revision methods to 
improve its fit to the data, and repeating this process until he is satisfied. 
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4 Initial Results on Ecosystem Data 

In order to evaluate our approach to scientific model revision, we utilized data 
relevant to the NPPc model available to the Earth science members of our team. 
These data consisted of observations from 303 distinct sites with known vegeta- 
tion type and for which measurements of Tempc, MON-FAS-NDVI, MONTHLY- 
SOLAR, SOL-CONVER, and UMD-VEG were available for each month during 
the year. In addition, other portions of CASA were able to compute values for the 
variables AHI, EET, and PET-TW-M. The resulting 303 training cases seemed 
sufficient for initial tests of our revision methods, so we used them to drive a 
variety of changes to the handcrafted model of carbon production. 



4.1 Results on Parameter Revision 



Our Earth science team members identified the equation for T2, one of the 
temperature stress variables, as a likely candidate for revision. As noted earlier, 
the handcrafted expression for this term was 



T2 = 1 8/[(l -I- e0-2(ropt-7’empc-10)^^2^ _|_ g-0.3(Tempc-Topt-10)^J 



which produces a Gaussian-like curve that is slightly assymetrical. This re- 
flects the intuition that photosynthetic efficiency will decrease when temperature 
(Tempc) is either below or above the optimal (Topt). 

To improve upon this equation, we defined x = Topt — Tempc as an interme- 
diate variable and recast the expression for T2 as the product of two sigmoidal 
functions of the form cr(a) = 1/(1 -I- exp(— a)) and a parameter. We transformed 
these into a neural network and used BPQ to minimize the error function 

•^1 = Y^sample (^PPc ~ Y^month^^ ' ^^(^>10 + ^11 ' x) ■ a{v 2 Q ~ V^l ■ x) ■ Rcst)^ , 

over the parameters {wo, vn, U 20 , 'C 21 }, where Rest = 0.56 • T1 • W • IPAR. 
The resulting equation generated in this manner was 



T2= 1.80/[(l-ke' 



0.05(Topt— Tempc— 10.8 



)(i 



0.03(T empc—T opt— 90.33 \ 



which has reasonably similar values to the original ones for some parameters but 
quite different values for others. 

The root mean squared error (RMSE) for the original model on the available 
data was 467.910. In contrast, the error for the revised model was 457.757 on 
the training data and 461.466 using leave-one-out cross validation. Thus, RF6’s 
modification of parameters in the T2 equation produced slightly more than one 
percent reduction in overall model error, which is somewhat disappointing. 

However, inspection of the resulting curves reveals a more interesting picture. 
Plotting the temperature stress factor T2 using the revised equations as a func- 
tion of the difference Topt — Tempc still gives a Gaussian-like curve, but within 
the effective range (from —30 to 30 Gelsius) its values decrease monotonically. 
This seems counterintuitive but interesting from an Earth science perspective. 
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as it suggests this stress factor has little influence on NPPc. Moreover, the origi- 
nal equation for T2 was not well grounded in first principles of plant physiology, 
making empirical improvements of this sort beneficial to the modeling enterprise. 
As another candidate for parameter revision, we selected the PET equation, 

PET = 1.6 • (10 • max(Tempc, 0) / AHI)'^ • PET-TW-M , 

which calculates potential water loss due to evaporation and transpiration given 
an unlimited water supply. By transforming this expression into 

PET = exp(ln(1.6) -fi A • In(lO)) • (max(Tempc, 0) / AHI)^ • PET-TW-M 

and replacing the parameter values ln(1.6) and In(lO) with the variables Vq and 
Vi, we constructed a neural network and used BPQ for error minimization. When 
transforming the trained network back into the original form, the equation that 
resulted was 

PET = 1.56 • (9.16 • max(Tempc, 0) / AHI)^ • PET-TW-M , 

which has values that are very similar to those in the original model’s equation. 

Moreover, since the RMSE for the obtained model was 464.358 on the train- 
ing data and 467.643 using leave-one-out cross validation, the revision process 
did not improve the model’s accuracy substantially. However, since the PET 
equation is based on Thornthwaite’s (1948) method, which has been used con- 
tinuously for over 50 years, we should not be overly surprised at this negative 
result. Indeed, we are encouraged by the fact that our approach did not revise 
parameters that have stood the test of time in Earth science. 

4.2 Results on Intrinsic Value Revision 

Another portion of the NPPc model that held potential for revision concerns 
the intrinsic property SRDIFF associated with the vegetation type UMD-VEG. 
For each site, the latter variable takes on one of 11 nominal values, such as 
grasslands, forest, tundra, and desert, each with an associated numeric value for 
SRDIFF that plays a role in the FPAR-FAS equation. This gives 11 parameters 
to revise, which seems manageable given the number of observations available. 

As outlined earlier, to revise these intrinsic values, we introduced one dummy 
variable, UMD-VEGfc, for each vegetation type such that UMD-VEG^ = 1 if 
UMD-VEG = k and 0 otherwise. We then defined SRDIFF (UMD-VEG) as 
exp(— • UMD-VEGfc) and, since SRDIFF’s value is independent of the 
month, we used BPQ to minimize, over the weights {ufc}, the error function 

•^2 = Esite (NPPc - exp(^^Ufc • UMD-VEGfc) • Rest)" , 

where Rest = ■^•0-5'(SR-FAS-1.08)-MONTHLY-SOLAR-SOL-CONVER. 

Table 3 shows the initial values for this intrinsic property, as set by the GASA 
developers, along with the revised values produced by the above approach when 
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Table 3. Original and revised values for the SRDIFF intrinsic property, along with 
the frequency for each vegetation type. 



vegetation type 


A 


B 


C 


D 


E 


F 


G 


H 


I 


J 


K 


original 


3.06 


4.35 


4.35 


4.05 


5.09 


3.06 


4.05 


4.05 


4.05 


5.09 


4.05 


revised 


2.57 


4.77 


2.20 


3.99 


3.70 


3.46 


2.34 


0.34 


2.72 


3.46 


1.60 


clustered 


2.42 


3.75 


2.42 


3.75 


3.75 


3.75 


2.42 


0.34 


2.42 


3.75 


2.42 


frequency 


3.3 


8.9 


0.3 


3.6 


21.1 


19.1 


15.2 


3.3 


19.1 


2.3 


3.6 



we fixed other parts of the NPPc model. The most striking result is that the 
revised intrinsic values are nearly always lower than the initial values. The RMSE 
for the original model was 467.910, whereas the error using the revised values 
was 432.410 on the training set and 448.376 using cross validation. The latter 
constitutes an error reduction of over four percent, which seems substantial. 

However, since the original 11 intrinsic values were grouped into only four 
distinct values, we applied RF6’s clustering procedure over the trained neural 
network to group the revised values in the same manner. We examined the effect 
on error rate as we varied the number of clusters from one to five; as expected, 
the training RMSE decreased monotonically, but the cross-validation RMSE was 
minimized for three clusters of values. The estimated error for this revised model 
is slightly better than for the one with 11 distinct values. 

Again, the clustered values are nearly always lower than the initial ones, a 
result that is certainly interesting from an Earth science viewpoint. We suspect 
that measurements of NPPc and related variables from a wider range of sites 
would produce intrinsic values closer to those in the original model. However, 
such a test must await additional observations and, for now, empirical fit to the 
available data should outweigh the theoretical basis for the initial settings. 

In another approach to revising intrinsic values, we retained the original 
grouping of vegetation types into sets, with each type in a given set having the 
same value. We utilized a weight-sharing technique to encode this background 
knowledge in a neural network. For example, let va and vp be weights corre- 
sponding to the SRDIFF values for vegetation types A and F, respectively; to 
ensure these values remained the same, we treated them as a single weight, say 
vaf- Here we can see that BPQ calculates the derivative of the error function 
over Vaf as a sum of the individual derivatives over va and vp, 

dT2 dT2 ^ 3T2 

dvAF dvA dvp 

In the trained neural network, the derivative over vaf becomes zero, but there 
is no guarantee that each derivative over va or vp will do so. Therefore, we can 
treat the sum of the absolute values for derivatives over shared weights, like va 
and vf, as a criterion for the ‘unlikeness’ among the elements of such a grouping. 

Table 4 shows the revised values for the intrinsic property SRDIFF that result 
from this approach, along with values for the unlikeness criterion defined above. 
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Table 4. Original and revised values, using the original groupings, for the SRDIFF 
intrinsic property, along with the frequency and unlikeness for each vegetation group. 



vegetation type 


AVF 


BVC 


EVJ 


DVGVHVIVK 


original 


3.06 


4.35 


5.09 


4.05 


revised 


2.23 


3.27 


2.54 


1.81 


frequency 


22.4 


9.2 


23.4 


44.9 


unlikeness 


26.1 


0.3 


2.3 


13.6 



As before, the obtained intrinsic values are always lower than the initial ones, 
and our criterion suggests that the group containing the vegetation types A and 
F has the least coherence. The RMSE for the revised model was 442.782 on the 
training data and 449.097 using leave-one-out cross validation, again indicating 
about four percent reduction in the model’s overall error. 



4.3 Results on Revising Equation Structure 

We also wanted to demonstrate our approach’s ability to improve the functional 
form of the NPPc model. For this purpose, we selected the equation for photo- 
synthetic efficiency, 

E = 0.56 -T1-T2-W , 

which states that this term is a product of the water stress term, W, and the two 
temperature stress terms, T1 and T2. Because each stress factor takes on values 
less than one, multiplication has the effect of reducing photosynthetic efficiency 
E below the maximum 0.56 possible (Potter & Klooster, 1998). 

Since E is calculated as a simple product of the three variables, one natural 
extension was to consider an equation that included exponents on these terms. 
To this end, we borrowed techniques from the RF5 system to create a neural 
network for such an expression, then used BPQ to minimize the error function 

•^3 = (NPPc - Emontk^O ' Tl“^ • T2“= • W“^ • IPAR)^ , 

over the parameters {mq, Mi, U 2 ) W 3 }, which assumes the equations that predict 
IPAR remain unchanged. We initialized uq to 0.56 and the other parameters 
to 1.0, as in the original model, and constrained the latter to be positive. The 
revised equation found in this manner, 

E = 0.521 • Tl° °° • T2° °3 • ™ , 

has a small exponent for T2 and zero exponents for T1 and W, suggesting the 
former influences photosynthetic efficiency in minor ways and the latter not at 
all. On the available data, the root mean squared error for the original model 
was 467.910. In contrast, the revised model has an RMSE of 443.307 on the 
training set and an RMSE of 446.270 using cross validation. Thus, the revised 
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equation produces a substantially better fit to the observations than does the 
original model, in this case reducing error by almost five percent. 

With regards to Earth science, these results are plausible and the most in- 
teresting of all, as they suggest that the T1 and W stress terms are unnecessary 
for predicting NPPc. One explanation is that the influence of these factors is al- 
ready being captured by the NDVI measure available from space, for which the 
signal-to-noise ratio has been steadily improving since CASA was first developed. 

These results encouraged us to explore more radical revisions to the func- 
tional form for photosynthetic efficiency. Thus, we told our system to consider a 
form that omitted the three stress factors but that included the four variables - 
Topt, Tempc, EET, and PET - that appear in their definitions: 

E = Vo • exp(— 0.5 • (rii • Topt -I- V2 ■ Tempc -I- W3 • EET -|- W4 • PET -|- ws)^) . 

This Gaussian-like activation function satisfies the constraint that E is positive 
and less than one. Running BPQ to minimize the error function over {rip, . . . W5} 
produced the equation 

E = 0.57 • exp(-0.5 • (-0.04 • Topt -k 0.03 • Tempc - 0.03 • EET -k 0.01 • PET)^), 

where we eliminated the parameter because its value was —0.003. The RMSE 
for the revised model was 439.101 on the training data and 444.470 using leave- 
one-out cross validation, indicating more than five percent reduction in error. 

These results are very similar to those from our first approach, which pro- 
duced a cross validation RMSE of 446.270. In this case, the revised model is 
simpler in that it defines E directly in terms of Topt, Tempc, EET, and PET, 
rather than relying on the theoretical terms Tl, T2, and W, two of which pro- 
vide no predictive power. On the other hand, the original form for E had a clear 
theoretical interpretation, whereas the new version does not. In such situations, 
the final decision should be left to domain scientists, who are best suited to 
balance a model’s simplicity against its interepretability. 

5 Related Research on Computational Discovery 

Our research on computational scientific discovery draws on two previous lines of 
work. One approach, which has an extended history within artificial intelligence, 
addresses the discovery of explicit quantitative laws. Early systems for numeric 
law discovery like Bacon (Langley, 1979; Langley et al., 1987) carried out a 
heuristic search through a space of new terms and simple equations. Numerous 
successors like Fahrenheit (Zytkow et ah, 1990) and RF5 (Saito & Nakano, 
1997) incorporate more sophisticated and more extensive search through a larger 
space of numeric equations. 

The most relevant equation discovery systems take into account domain 
knowledge to constrain the search for numeric laws. For example, Kokar’s (1986) 
Coper utilized knowledge about the dimensions of variables to focus attention 
and, more recently, Washio and Motoda’s (1998) SDS extends this idea to sup- 
port different types of variables and sets of simultaneous equations. Todorovski 
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and Dzeroski’s (1997) LaGramge takes a quite different approach, using do- 
main knowledge in the form of context-free grammars to constrain its search 
through a space of differential equation models that describe temporal behavior. 

Although research on computational discovery of numeric laws has empha- 
sized communicable scientific notations, it has focused on constructing such laws 
rather than revising existing ones. In contrast, another line of research has ad- 
dressed the refinement of existing models to improve their fit to observations. 
For example, Ourston and Mooney (1990) developed a method that used train- 
ing data to revise models stated as sets of propositional Horn clauses. Towell 
(1991) reports another approach that transforms such models into multilayer 
neural networks, then uses backpropagation to improve their fit to observations, 
much as we have done for numeric equations. Work in this paradigm has em- 
phasized classification rather than regression tasks, but one can view our work 
as adapting the basic approach to equation discovery. 

We should also mention related work on the automated improvement of 
ecosystem models. Most AI work on Earth science domains focuses on learn- 
ing classifiers that predict vegetation from satellite measures like NDVI, as con- 
trasted with our concern for numeric prediction. Chown and Dietterich (2000) 
describe an approach that improves an existing ecosystem model’s fit to contin- 
uous data, but their method only alters parameter values and does not revise 
equation structure. On another front, Schwabacher and Langley (2001) use a 
rule-induction algorithm to discover piecewise linear models that predict NDVI 
from climate variables, but their method takes no advantage of existing models. 

6 Directions for Future Research 

Although we have been encouraged by our results to date, there remain a number 
of directions in which we must extend our approach before it can become a useful 
tool for scientists. As noted earlier, we envision an interactive discovery aide 
that lets the user focus the system’s attention on those portions of the model 
it should attempt to improve. To this end, we need a graphical interface that 
supports marking of parameters, intrinsic properties, and equations that can be 
revised, as well as tools for displaying errors as a function of space, time, and 
predictive variables. 

In addition, the current system is limited to revising the parameters or form 
of one equation in the model at a time, as well as requiring some handcrafting 
to encode the equations as a neural network. Future versions should support 
revisions of multiple equations at the same time, preferably invoking the same 
variants of backpropagation as we have used to date, and also provide a li- 
brary that maps functional forms to neural network encodings, so the system 
can transform the former into the latter automatically. We should also explore 
using other approaches to equation discovery, such as Todorovski and Dzeroski’s 
LaGramge, in place of the RF6 algorithm. 

Naturally, we also hope to evaluate our approach on its ability to improve 
other portions of the CASA model, as additional data becomes available. An- 
other test of generality would be application of the same methods to other sci- 
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entific domains in which there already exist formal models that can be revised. 
In the longer term, we should evaluate our interactive system not only in its 
ability to increase the predictive accuracy of an existing model, but in terms of 
the satisfaction to scientists who use the system to that end. 

Another challenge that we have encountered in our research has been the need 
to translate the existing CASA model into a declarative form that our discovery 
system can manipulate. In response, another long-term goal involves developing 
a modeling language in which scientists can cast their initial models and carry 
out simulations, but that can also serve as the declarative representation for 
our discovery methods. The ability to automatically revise models places novel 
constraints on such a language, but we are confident that the result will prove a 
useful aid to the discovery process. 



7 Concluding Remarks 

In this paper, we addressed the computational task of improving an existing sci- 
entific model that is composed of numeric equations. We illustrated this problem 
with an example model from the Earth sciences that predicts carbon production 
as a function of temperature, sunlight, and other variables. We identified three 
activities that can improve a model - revising an equation’s parameters, alter- 
ing the values of an intrinsic property, and changing the functional form of an 
equation, then presented results for each type on an ecosystem modeling task 
that reduced the model’s prediction error, sometimes substantially. 

Our research on model revision builds on previous work in numeric law dis- 
covery and qualitative theory refinement, but it combines these two themes in 
novel ways to enable new capabilities. Clearly, we remain some distance from 
our goal of an interactive discovery tool that scientists can use to improve their 
models, but we have also taken some important steps along the path, and we 
are encouraged by our initial results on an important scientific problem. 
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Abstract. An EPS is a kind of logic programs expressing various for- 
mal languages. We propose an efficient derivation for EFS’s called an 
S-derivation, where every possible unifiers are evaluated at one step of 
the derivation. In the S-derivation, each unifier is partially applied to 
each goal clause by assigning variables whose values are uniquely deter- 
mined from the set of all possible unifiers. This contributes to reduce the 
number of backtracking, and thus the S-derivation works efficiently. In 
this paper, the S-derivation is shown to be complete for the class of regu- 
lar EFS’s. We implement an EFS interpreter based on the S-derivation in 
Prolog programming language, and compare the parsing time with that 
of DCG provided by the Prolog interpreter. As the results of experiments, 
we verify the efficiency of the S-derivation for accepting context-free lan- 
guages. 



1 Introduction 

In the area of machine learning or discovery science, it is an important issue 
to develop efficient systems dealing with formal languages under a theoretical 
background. An elementary formal system {EFS, for short) is a kind of logic 
programs over the domain of strings [3,11,15]. The EFS’s are well-known to be 
flexible enough to represent not only classes of languages in Chomsky hierar- 
chy [3] but also binary relations over strings [12,13]. It has been shown that the 
EFS is suitable to discuss learnability in the framework for inductive inference 
and machine learning of languages [2,3,9,10]. Mukouchi and Arikawa [8] devel- 
oped a theoretical framework for machine discovery, where refutability of search 
space is shown to be the most important factor and one of such refutably learn- 
able classes is the class of length-bounded EFS’s. Theoretically, EFS’s can be 
used as working systems as Prolog programs because a derivation based on the 
resolution principle [7] is also defined for EFS’s. In EFS’s, a derivation proce- 
dure is formalized as an acceptor for formal languages [3,15]. Furthermore, the 
derivation can be used to generating languages [14]. The purpose of this research 
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is to develop an efficient derivation and construct an EFS interpreter based on 
the derivation. 

Since an EFS deals with strings as its domain, unifications for strings should 
be computed efficiently at each step of the derivation. However, it is known that 
the unification problem for strings is computationally hard and the unifier is 
not always uniquely determined even if it is restricted to the maximally gen- 
eral unifier [ 5 , 6 ]. On the other hand, for the first order terms used in Prolog 
programming language, the unifier is uniquely determined as the most general 
unifier. Therefore, in an EFS, backtracking occurs for each selection of unifiers 
as well as clauses. Harada et al. [ 4 ] introduced restricted EFS’s called variable- 
separated EFS’s, where there is no variable successively occurring in any term. In 
the variable-separated EFS, the number of possible unifiers is decreased, and the 
derivation works efficiently. However, the size of a variable-separated EFS is pos- 
sibly to be much larger than that of the non-variable-separated EFS equivalent 
to it. This causes inefficiency in parsing languages. Here, we introduce another 
approach to develop an efficient EFS interpreter. 

When strings have successive occurrence of variables, the number of unifiers 
becomes large as pointed out by Harada et al. [ 4 ]. For example, for the strings 
xyz of variables and 0102 ■ ■ ■ a„ of constant symbols, they have 0(ji^) unifiers, 
because, for each i {i = 1 , 2 , . . . , n — 2 ) and j (j = f -I- 1 , i -I- 2 , . . . , n — 1 ), 
all substitutions replacing x with 0102 • • • Oj, y with 0^+ 10^+2 O' j, and z with 
aj-i-iaj+2 • • • On are unifiers of them. In EFS’s, since there are many selections for 
unifiers at each step of a derivation, it has been difficult to construct an efficient 
interpreter. Thus, we propose a new approach to evaluate all possible unifiers 
at one step of the derivation. We formalize a derivation with sets of unifiers (an 
S-derivation, for short). In the S-derivation, each unifier is partially applied to 
each goal clause by assigning variables whose values are uniquely determined 
from the set of all possible unifiers. The S-derivation is a natural extension of 
the standard derivation for EFS’s, because the set of unifiers can be regarded as 
the unique unifier in EFS’s corresponding to the most general unifier in the first 
order language. We show that the S-derivation is complete for restricted EFS’s 
called regular EFS’s which define the class of languages equivalent to that of 
context-free languages. 

We implement an S-derivation for regular EFS’s in Prolog programming lan- 
guages, and verify the efficiency of the S-derivation by comparing the running 
time of the S-derivation with that of definite clause grammars (DCG’s) pro- 
vided by the Prolog interpreter. In our EFS interpreter, each unifier is efficiently 
computed by using the Aho-Corasick pattern matching algorithm [ 1 ] . The Aho- 
Corasick algorithm finds all occurrences of patterns on the text in linear time 
with the length of the text. A regular EFS is suitable to the computation of the 
unification, because each string in the derivation becomes a substring of the ini- 
tially given text. Therefore, every unifiers used in a derivation can be computed 
by only once scanning on the given text. As the results of experiments, we show 
that the S-derivation using the Aho-Corasick algorithm is efficient with respect 
to the length of a given text and the number of variables in the EFS. 
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This paper is organized as follows: In Section 2, we give some notations 
and definitions including derivation and semantics for EFS’s. In Section 3 and 
4, we introduce S-derivation, and prove completeness of the S-derivation. In 
Section 5, we outline the EFS interpreter based on the S-derivation, and show 
experimental results for typical examples of EFS’s, where the S-derivation works 
efficiently. Finally, we summarize the results of this research, and describe some 
open problems. 

2 Preliminaries 

In this section, we give some basic definitions and notations according to [3,14, 
15]. 

2.1 Elementary Formal Systems 

For a given set A, the set of all finite strings of symbols from A is denoted by 
A*. The empty string is denoted by e. A~^ denotes the set A* — {e}. 

Let S, X, and 77 be mutually distinct sets. We assume that 77 is a finite 
set of constant symbols, X is a set of variables, and 77 is a finite set of predicate 
symbols. Each predicate symbol is associated with a non-negative integer called 
its arity. 

A term is an element of (77UA)“''. A term is said to be regular, if every 
variable occurs at most once in the term. An atomic formula {atom, for short) is 
of the form p(7Ti, 7T2, . . . , 7r„), where p is a predicate symbol with arity n and each 
TTi is a term {i = 1, 2, . . . ,n). A definite clause {clause, for short) is of the form 
A <— Bi, ... , Bn {n > 0), where A, Bi, . . . , Bn are atoms. The atom A and the 
sequence 7?i, . . . , 77„ are called the head and the body of the clause, respectively. 
A goal clause {goal, for short) is of the form ^ B\, . . . ,B„ {n > 0) and the 
goal with n = 0 is called the empty goal. An expression is a term, an atom, a 
clause, or a goal. An expression E is said to be ground, if E has no variable. 
For an expression E and a variable x, var{E) and oc(a:, 77) denote the set of all 
variables occurring in 77, and the number of occurrences of x in 77, respectively. 
An elementary formal system {EFS, for short) is a finite set of clauses. 

A substitution 0 is a (semi-group) homomorphism from (77 U A)+ to itself 
satisfying the following conditions: 

1. aO = a for each a € 77, and 

2. the set {x € X \ x9 ^ x}, denoted by D{0), is finite. 

For a substitution 9, if 77(6*) = {xi,X 2 , ■ . ■ ,Xn} and Xi9 = TTi for every i 
{i = 1,2, ... ,n), then 9 is denoted by the set {xi/tti,X 2 /tt 2 , ■ ■ ■ ,Xn/'!^n}- For 
an expression 77 and a substitution 9, E9 is defined as the expression by simul- 
taneously replacing each variable a; in 77 with x9. 

Let (77 i, 772) be a pair of expressions. Then a substitution 9 is said to be 
a unifier of 77i and E 2 if E\9 = 7720. The set of all unifiers 9 of 77i and E 2 
satisfying 77(0) C var{Ei) U var{E 2 ) is denoted by U{Ei,E 2 ). We say that 77i 
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and i ?2 are unifiable if the set U{Ei,E 2 ) is not empty. An expression E\ is a 
variant of E 2 if there exist two substitutions 9 and 5 such that Ei9 = E 2 and 
E 2 S = E\ . 

2.2 The Semantics of EFS’s 

We give two semantics of EFS’s by using provability relations and derivations. 
First, we introduce the provability semantics. Let E and C be an EFS and a 
clause. Then, the provability relation F h C is inductively as follows: 

1. If C G r then rh C. 

2. li r \- C then E \- C 9 for any substitution 9. 

3. li E \- A ^ Bi, . . . , Bjn and E h Bm ^ then E \- A ^ Bi, . . . , Bm-i- 

A clause C is provable from E ii E \- C holds. The provability semantics of the 
EFS E, denoted by PS{E), is defined as the set of all ground atoms A satisfying 
that E \- A For an EFS E and a unary predicate symbol p, the language 
defined by E and p is denoted by L{E,p), and defined as the set of all strings 
w G 27+ such that p{w) G PS{E). 

The second semantics is based on a derivation for EFS’s. We assume a com- 
putation rule R to select an atom from every goal. Let E be an EFS, G be 
a goal, and i? be a computation rule. A derivation from G is a (finite or infi- 
nite) sequence of triplets (Gj, Gj, 9i) (i = 0, 1, . . .) which satisfies the following 
conditions: 

1 . Gi is a goal, 9i is a substitution, Ci is a variant of a clause in F, and Go = G. 

2. var{Ci) fl var(Cj) = 0 for every i and j {i 7 ^ j), and var{Ci) fl var{Gi) = 0 
for every i. 

3. If Gi =<— Ai,. . . , Ak, and Am is the atom selected by R, then Gi is of the 
form A ^ Bi, . ■ ■ 1 Bn satisfying that A and Am are unifiable, 9i G U{A, Am), 
and Gi+i is of the following form: 

Ai, . . . ,A m— 1 ; Bi , . . . , Bn , Am-\-l , . • • , Ap^)9i. 

The atom Am is called a selected atom of Gj, and Gi+i is called a resolvent 
of Gi and Gi by 9i. 

A refutation is a finite derivation ending with the empty goal. The procedural 
semantics of an EFS F, denoted by RS{E), is defined as the set of all ground 
atoms A satisfying that there exists a refutation of F from the goal ^ A. 

It has been shown that PS{E) = RS{E) for every EFS F [15]. This implies 
that a string w G 27+ is in the language defined by an EFS F and a predicate 
symbol p if and only if there exists a refutation of F from <— p(w). Thus, the 
derivation procedure can be regarded as an acceptor for the language. 

Finally, we give the distinct set from an EFS language. Let F be an EFS, 
and (Gi, Gi, df) (i = 0, 1, . . . , n) be a finite derivation of F. The derivation is said 
to be finitely failed with the length n if there exists no clause in F such that its 
head and the selected atom of G„ are unifiable. Furthermore, we define FFS'(F) 
as the set of all ground atoms A satisfying that all derivations of F from ^ A 
are finitely failed within the length n. 




354 N. Sugimoto, H. Ishizaka, and T. Shinohara 



3 Extended Derivations with Sets of Unifiers 

In this section, we introduce a derivation with sets of unifiers {S- derivation, for 
short). In the S-derivation, each unifier is partially applied to each goal clause 
by assigning variables whose values are uniquely determined from the set of 
all possible unifiers. Since there are infinitely many unifiers for terms contain- 
ing variables, it is difficult to compute the derivation from the goal containing 
variables. However, for restricted terms, all unifiers are computable by using 
maximally general unifiers. The S-derivation works efficiently by using the max- 
imally general unifiers. Furthermore, in this section, the S-derivation is shown to 
be complete for accepting and generating languages defined by restricted EFS’s 
called regular EFS’s. 

3.1 Maximally General Unifiers 

Let 9 = {xi/tti,X 2 /tt 2 , . . .,Xm/T^m} and S = {yi/n, 7 / 2 /T 2 , ■ • • , J/n/r„} be sub- 
stitutions. Then, we define a composition of 9 and S as follows: 

9 ■ 6 = {xi/wiS I x^ yf TTiS} U {yi/Ti \ D{9)}. 

Let 9, S and 7 be substitutions, and E be an expression. Then, we can prove the 
following equations along the same line of argument as definite programs [7]: 

1. {E9)S = E{9 • (5), and 

2. (6* • i 5 ) • 7 = 0 • (U 7). 

Let y be a finite set of variables, and {9,5) be a pair of substitutions. Then, we 
say that 9 and 5 are equivalent on V , if tt9 is a variant of 7 T (5 for any tt S (LAJU)’*'. 
We show that the problem of determining whether or not 9 and 6 are equivalent 
on V is solvable by the following lemma. 

Lemma 1. Let 9 and 5 be substitutions, and V = {x\,X 2 , . . . ,Xn} be a finite 
set of variables. Then, 9 and 5 are equivalent on V if and only if the following 
statements hold: 

1. x9 is a variant of xS, for every x gV, and 

2. X 1 X 2 • • • x„9 is a variant of X 1 X 2 ■ • • x„(5. 

Proof. We can prove this lemma by the induction on the length of tt G {EGiV)'’^ . 

□ 

Let {Ei,E 2 ) be a pair of expressions. A maximally general unifier {mxgu, 
for short) of Ei and E 2 is a unifier 9 G U{Ei,E 2 ) satisfying that, for any 
5 G U{Ei,E 2 ) such that 9 and 5 are equivalent on var{Ex) U var{E 2 ), there 
is no substitution 7 such that 9 = 5 ■ j. The set of all mxgu’s of Ei and E 2 is 
denoted by MXGU{Ei,E 2 ). 

For two terms tt and r, we define the number of mxgu’s of tt and r as 
the cardinality of equivalence classes of substitutions on var{Tr) U var{r). Thus, 
we say that MXGU{tt,t) is finite, if the number of mxgu’s is finite without 
equivalent substitutions on var{Tr) U var{r). From the definition of maximally 
general unifiers, the following lemmas hold [5,6,14]. 
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Lemma 2. Let tt and t he regular terms such that var(jr) n var^r) = 0. Then, 
the set MXGU{tt,t) is finite and computable. 

Lemma 3. Let tt and t be terms, //tt is ground, then the set MXGU{tt,t) is 
finite and computable, and MXGU{tt,t) = holds. 

Lemma 4. Let x he a variable and tt he a term which does not include x. Then, 
MXGU{tt,x) is a singleton set which consists of the substitution {x/tt}. 

3.2 S-Derivation 

In the following argument, we assume that every substitution 9 satisfies 6-9 = 9, 
that is, var{x9) n D{9) = 0 for every variable x G D{9). 

Definition 1. For two substitutions 9 and 6, we define 9 o 6 as the set of all 
substitutions a satisfying that a = 9- S- j = S- 9- j for some substitution 7. 

Note that, for each element cr of the set 9 o 5, xa becomes the element of the 
intersection of sets of strings which are unifiable with x9 and x5. 

Substitutions 9 and 5 are said to be inconsistent if 6* o 5 = 0, and consistent, 
otherwise. We define MIN{9oS) as the minimum subset of 6*oi5 satisfying that, 
for any a G 9 o S, there exists a' G MLN{9 o 6) such that a = a' ■ j for some 
substitution 7. 

For two finite sets 0 and A of substitutions, we define 

1. MLN{0oA)= (J MIN{9oS), and 

{e,5)e0xA 

2. INT{0) = Pi 6». 

6»ee 

Lemma 5. Let 9 and S be substitutions. LfS is ground, then the set MLN{9oS) 
is finite and computable. 

Proof. Let 9 and S be substitutions {xi/iTi \ i G {1,2,..., m}} and {yifti \ i G 
{1, 2 , . . . , n}}, respectively. 

If a G 9 o S then there exists a substitution 7 satisfying that 

1. a = 9 ■ 6 • j = {xifTTiS^ I i = 1,2, ... , m} U 5 U 7, and 

2. TTiSj = tj for every Xi = yj G D{9) n D{5), 

from Definition 1 . Let S be the set of all possible 7 satisfying the above conditions 
and £>(7) C var(7TiS) U var{7T2S) U • • • U var^iTmS). Since, from Lemma 3, the set 
U{7Ti6,tj) is finite for each Xi = yj G D{9) n D{6), the set S is also finite and 
computable. It is clear that a = 9 ■ 6 ■ ^ G MLN{9 o S) for each 7 G S', because 
every 7 G S is ground. Futhermore, we can show that, for every substitution 7', 
il 9 -S-j' = S -9 ■'y' holds, then there exists 7 G S such that 7' = 7-7". Thus, the 
set MLN{9 o 5) consists of 6* • i5 • 7 for every 7 G S. It is clear that MIN{9 o <5) 
is finite and computable. Q 
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Example 1 . We consider the substitutions: 9 \ = {x/y}, 9 i = {xjaaz^yjaz}, 
O3 = {x/aaz,y/zb}, and S = {x/aaa,y/ab}. 

From x 0 i 6 = yS = ab and xS = aaa, the set U{x 9 iS,xS) is empty. Thus, 
01 o i 5 = 0 , and 9 and 6 are inconsistent. 

From x 92S = aazS = aaz and xS = aaa, U{x925,xS) is the set {{z/a}}. From 
t/02<5 = az 5 = az and y 5 = ab, U{y92S,y6) is the set {{z/b}}. Then, there exists 
no substitution 7 such that x02<^7 = ctaa and j/02^7 = cib. Therefore, 92 ° 6 = %, 
and 02 and 6 are inconsistent. 

From x 9 sS = aazS = aaz and x6 = aaa, U{x 9 -i 5 ,xS) is the set {{z/a}}. 
From 1/03(5 = zb 5 = zb and y 5 = ab, U{y 9 ^ 5 ,y 5 ) is the set {{z/a}}. Then, only 
the substitution 7 = {z/a} satisfies x 9 ^ 5 ^ = aaa and y 9 ^ 5 ^ = ab. Therefore, 
MIN {9 o S) has only one element 9 ^ ■ S ■ ^ = {x/aaa, y/ab, z/a}, and 03 and 5 
are consistent. 

Lemma 6. Let 9 = {xi/iTi \ i = 1 , 2 , ...,m} and S = {y/r} be substitutions 
satisfying that, for each i {i = 1,2, .. . ,m), and r are regular, and var{'Ki) n 
var^r) = H( 0 ) nuar(r) = 0 . Then, the set MIN{ 9 o 6 ) is finite and computable. 

Proof. If y ^ D{ 9 ) then 6 ■ 9 ■ S = 9 ■ S and 9 ■ 6- 6 = 9-6 from the assumption and 

6 ■ S = S. Thus, 9 ■ 6 G 9 o S holds. Furthermore, for every a G 9 o S, a = 9 ■ 5 ■ ^ 
holds for some substitution 7. Thus, the set MIN{ 9 oS) is a singleton set {0 • 6}. 

If y = Xk G D{ 9 ) then y ^ var{'Ki) for every i (i = l,2,...,m) from the 
assumption 0-0 = 0 . Thus, 9-6 = 9 holds. Furthermore, from the assumption of 
this lemma, var{T) n D{ 9 ) = 0 and 5 • 0 = ( 5 U {xi/TTi \ i ^ k}. If tta, and t are not 
unifiable, then 0 and 6 are inconsistent. Otherwise, for any 7 G MXGU(7Tk, t), 
9 ■ j G 9 o 6, because 9 - 6 --j = 6- 9 - j = 9 - j holds. Furthermore, from the 
definition of mxgu’s, for any substitution a such that 9 ■ 6 ■ a G 9 o 6, there exists 

7 G MXGU(7Tk,T) satisfying that a = 7 • 7' for some substitution 7'. Thus, 
MIN{ 9 oS) is the set {0-7 | 7 G MXGU{nk,T)}. Since the set MXGU{'Kk,x) is 
finite and computable from Lemma 2 , MIN{ 9 oS) is also finite and computable. 

□ 

Example 2 . Let 0 i = {x/ aya,z/y}, 02 = {y/aza}, and 5 = {y/yxy^}. Then, 
MIN{ 9 ioS) is a singleton set which consists of 9-6 = {x/ay\y2a, y/yij/2, z/y\y2}. 
On the other hand, since 

f {Vx/a,y2/za} 

MXGU{aza,yiy2) = < {yx/az,y2/a} 

[ {yi/azi,y2/z2a,z/ziZ2} 

we can obtain the following set: 

{ {y/aza,yi/a,y2/za} 

{y/aza, yi/az,y2/a} 
{y/azxZ2a,yx/azx,y2/z2a,z/zxZ2} 





Definition 2. Let T be an EFS, G be a goal of E , and i? be a computation rule. 
An S-derivation from G is a (finite or infinite) sequence of triplets {Gi,Gi, 0 i) 
(i = 0, 1, . . .) which satisfies the following conditions: 
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1. Gi is a goal, 0i is a finite set of substitutions, Ci is a variant of a clause in 
r, and Go = G. 

2. var{Ci) fl var{Cj) = 0 for every i and j {i yf j), and var{Ci) fl var{Gi) = 0 
for every i. 

3. Let Gj =<— Ai, . . . , Ak, Gj = A <— i?i, . . . , Bq, and A^ is the selected atom 
of Gi- If i = 0, then 0i = MXGU{Am,A). Otherwise, 0i = MIN{0i-i o 
MXGU{Am,A)) for each i. The next goal Gi+i is of the following form: 

(<— Ai, . . . , Am-l, Bi, . . . , Bq, Am+1, • ■ ■ , Ak)INT{0i). 

If the S-derivation ends with the empty goal G„, then it is said to be an S- 
refutation from G, and each substitution in G„_i is called an answer substitution 
for G by B. 



Definition 3. Let B be an EFS, and (Gj,Gi,0i) (i = 0,1,..., n) be a finite 
S-derivation of B. The derivation is said to be finitely failed with the length n if 

1. 0n = 0, or 

2. there exists no clause in B such that its head and the selected atom of Gn 
are unifiable. 

For an FFS B, we define the following two sets: SBFS{B) is the set of all ground 
atoms A satisfying that all S-derivations of B from ^ A are finitely failed within 
the length n, and SRS{B) is the set of all ground atoms A satisfying that there 
exists an S-refutation of B from ^ A. 



3.3 Completeness of S-Derivation 

An EFS B is said to be regular if all predicate symbols in B are unary, and each 
clause A ^ Bi, B 2 , ■ ■ ■ , Bn in B satisfying the following conditions: 

1. the term in A is regular, 

2. every term in B\, B 2 , ■ ■ ■ , Bn are mutually distinct variables, and 

3. var{Bi) U var{B 2 ) U • • • U var{Bn) C var{A). 

It has been shown that the class of languages defined by regular EFS’s is equiv- 
alent to that of context-free languages [3]. For the regular EFS, we show that an 
S-derivation is complete by the following theorem. 

Theorem 1. For every regular EFS B, PS{B) = RS{B) = SRS{B) holds. 

The above theorem can be proved by the following lemmas and proposition. 

Lemma 7. Let B be a regular EFS, Go be a ground goal, and (Gi,Gi,0i) 
{i = 0,1,..., n) be an S-derivation from Gq. Then, for every a € 0n-i, cr 
is ground, and there exists a derivation {G[,Gi,9i) {i = 0,1,..., n) such that 
Gq = Gq, and G' = G^cr for each i (i = 1,2, . . . ,n) . 
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Proof. Let p{w) be an selected atom of Go, and p{tt) be the head of Gq. Then, 
from the definition of an S-derivation, Oq = U{w,tt) holds. Furthermore, for 
every a G Oq, INT{0q) C a holds. Let G[ be the resolvent of Gq and Gq by cr. 
Then, the derivation (Gq, Go, cr), (G^, _, _) satisfies the statement. 

Next, we assume that p{r) be an selected atom of G„_i, and p{n) be the 
head of G„_i. Then, from the definitions of an S-derivation and a regular EFS, 
T is a gound term w G or a variable x G D{a„- 2 ) for every cr „_2 G On-i- 
If CT G 0n-\ then there exists cr „_2 G Qn -2 such that a G cr „_2 o d for some 
8 G MXGU{t,tt). 

If the selected atom is p{w) then <5 is ground. Thus, a is also ground. If the 
selected atom is p{x) then 8 = {x/tt} from Lemma 4. Since x/w G cr „_2 for some 
w G , a G cr „_2 o i5 = {cr „_2 • 7 | 7 G MXGU{w, tt)}. Thus, a is ground. 

From the assumption of the induction, there exists a derivation {G^,Ci,9i) 
{i = 0,l,...,n — 1) such that Gq = Gg, and G' = Gian -2 for each i {i = 
1,2, ...,n— 1). Since ct „_2 is ground, it is clear that ct „_2 C a. Let G'n be 
the resolvent of G'n-i and G„_i by 0„_i G U{w,n), then it is clear that the 
derivation (G', Gj, 0i) (i = 0, 1, . . . , n) satisfies the statement. Q 

Lemma 8. Let P be a regular EFS, Gq he a ground goal, and {Gi, Ci, 9i) (z = 
0, 1, . . .) he a derivation from Gq. Then, there exists an S-derivation (G', Gj, 0i) 
(z = 0, 1, . . .) and a substitution Oi G 0i such that Gg = Gg, and G'ij^^ai = Gi+i 
for each i (z = 1 , 2 , . . .). 

Proof. Let p{w) be an selected atom of Gq, and p{tt) be the head of Gg. Then, 
from the definition of a derivation, 9q G U{w,tt). On the other hand, from the 
definition of an S-derivation, Gg = MXGU{w,tt) = U{w,tt). Thus, 9q G Gg, 
and G'i 6 »g = Gi. 

Next, we assume that there exists ak-i G Gfc_i such that G'j^ak-i = Gfc. Let 
p{w) be an selected atom of G^, and the head of G^ be p{tt). Then, from the 
definition of a derivation, 9k G U{w,tt). On the other hand, from the definition 
of an S-derivation and the assumption of the induction, the selected atom of G'j. 
has the form p{w) or p{x) for some w G and x G D{ak-i). 

If the selected atom is p{w) then 0k = MIN{0k-i o U{w,n)). Since ak-i G 
Gfc_i and 9k G U{w,tt), Uk-i o9k C Gj, holds. Furthermore, from the definitions 
of an S-derivation, D{ak-i) H D{9k) = 0. Thus, ak-i and 9k are consistent, and 
c^fc — 1 U 9k G Uk—i o 9k. 

If the selected atom is p{x) then 0k = MIN{0k-\ o MXGU{x,tt)), and 
MXGU{x,tt) is a singleton set {{x/tt}} from Lemma 4. Since ak-i G Gfc_i, 
MIN{ak-i o{x/tt}) C 0k holds. Furthermore, from x/w G ak-i and U{w,n) yf 
0, {x/tt'\ and Uk-i are consistent. From the statement in the proof of Lemma 6 , 
M I N {{x / tt} o ak-i) = {ak-i -7 | 7 G MXGU{w,tt)}. Since 9k G MXGU{w,tt), 
c^fc-i • 9k = ak-i U G Gfc. 

It is clear that ak-i U 9k becomes a substitution, and satisfies the statement. 

□ 

From Lemma 7 and 8 , we can prove the following proposition. 
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Proposition 1. Let F he a regular EFS, Go be a ground goal. Then, there exists 
a refutation from Go if and only if there exists an S-refutation from Go- 

Furthermore, we can also prove the following theorem. 

Theorem 2. For every regular FFS F, FFS{F) = SFFS{F) holds. 

Example 3. For an EFS 

{ (1) p{xy) ^ q{x),r{y)-, 

(2) g(a") 

(3) r{aa) ^ 

and a goal ^ Fig 1 describes the derivation and the S-derivation as 

trees like SLD-trees [7]. In the derivation and the S-derivation in Fig 1, the label 
(fc, 9) on each edge represents the derivation by the clause (fc) and the unifier or 
the set of unifiers 9. The derivation needs n-l- 1 times backtracking to determine 
G FF{F). On the other hand, in the S-derivation, it is determined by 
twice backtracking. 











(1 


J , 2 , n-l\ 




1 1 . 


<x/a ,y/a > 


1 




■ j/aj . 


— qix),r{y) 



I 

I (2, ) 

' I 

fail 



Fig. 1. Backtracking by a derivation and an S-derivation 
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4 An Implementation of EFS Interpreter 

In this section, we outline an implementation of EFS interpreter based on an 
S-derivation, and give some results of experiments for typical examples of EFS’s 
where the S-derivation works efficiently. 

In order to construct an efficient interpreter, we adopt two ideas: 

1. computing each unifiers by using the Aho-Corasick pattern matching algo- 
rithm, and 

2. reducing the number of backtracking of a derivation by using an S-derivation. 

4.1 Unifications by the Aho-Corasick Pattern Matching Algorithm 

The Aho-Corasick pattern matching algorithm finds all occurrence positions of 
some patterns by scanning the given text. From a given EFS, the EFS interpreter 
makes a pattern matching machine in advance for all ground strings in the EFS. 
For each given ground goal, the pattern matching machine scans the ground 
term in the given goal clause and outputs all occurrence positions of patterns on 
the term. From the occurrence positions, each unifier is efficiently computed. 

Example 4- Let w = aaabaaabaaabaaa and t = xbyabz be terms, where a,b G E 
and x,y, z G X. For constant substrings b and ab of r, the pattern matching 
machine finds the occurrence positions on w as follows: 

b : (4: 4), (8: 8), (12: 12), 
a6: (3: 4), (7: 8), (11: 12), 

where, (i : j) in the line of b {resp. ab) means that the substring from ith to jth 
of ru is 5 {resp. ab). For each occurrence {it : jb) of b and {iab ■ jab) of ab such 
that jb < iab, we obtain unifiers of w and r as follows: 

{x/{l : 3), y/(5 : 6), 0/(9 : 15)} from ((4 : 4), (7 : 8)), 

{x/{l : 3), y/(5 : 10), z/(13 : 15)} from ((4 : 4), (11 : 12)), 

{x/{l : 7), y/(8 : 10), z/(13 : 15)} from ((8 : 8), (11 : 12)). 

A regular EFS is suitable to this computation of the unifier, because every 

ground term in each resolvent of the derivation becomes a substring of the term 
in the given initial goal. This implies that all unifiers used in the derivation 
can be computed by only once scanning the given initial goal by the pattern 
matching machine. 



4.2 An Implementation of S-Derivation 

Since an S-derivation deals with the set of all possible unifiers at each step of the 
derivation, it is important to adopt a compact representation of the set. From 
the property of a regular EFS, all terms in the derivation are substrings of the 
term in the given initial goal. Thus, the set of all possible unifiers can be divided 
into some parts as shown by the next example. 
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Example 5. Let w = aabaabaabaa and r = xybz be terms, where a,b G E and 
x,y,z G X. Then, the set of all unifiers 



{x/a,y/a, z/aabaabaa}, {x/a, y/abaa, z/aabaa}, 
{x/aa, y/baa, z/aabaa}, {x/aab, y/aa, z/aabaa}, 
{x/aaba, y/a, z/aabaa}, {x/a, y / abaabaa, z/aa}, 
{x/aa, y/baabaa, z/aa}, {x/aab, y/aabaa, z/aa}, 
{x/aaba, y/abaa, z/aa}, {x/aabaa, y/baa, z/aa}, 
{x/aabaab, y/aa, z/aa}, {x / aabaaba, y/a, z/aa} 



Furthermore, each Ui {i = 1,2,3) is represented as follows: 



t/i = {W(l:l),2//(2:2),z/(4:ll)}}, 

U 2 = {{a;/(l : k),y/{k + l : 5),z/(7 : 11)} | 4 > fc > 1}, 
= {{2;/(l : k),y/{k + l : 8),2;/(10 : 11)} | 7 > A; > 1}, 



where each (i : j) represents the substring from ith to jth of w. For an EFS 
r = {p{xyhz) ^ qi{x),q 2 {y),q 3 {z)-, qi{aa) q 2 {baa) qsiaabaa) ^} and 
the set U 2 , the S-derivation from <— p{w) is shown in Fig 2. 



can be divided into these three parts: 




P(l:ll) 



p{xy[6:6]z)*— ?i(v), ^ 2 ( 1 '): ft(z) 








{;c/(l:2), y/(3,5), z/{7:ll)} 




□ 



Fig. 2. An S-derivation from the goal <— p{w). 
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We can easily show that the derivation with the divided sets on unifiers is 
equivalent to the S-derivation. From the occurrence positions given by the pat- 
tern matching machine, each set Ui is efficiently computed. Thus, an S-derivation 
is efficient. 



4.3 Experimental Results 



We construct three types of EFS interpreters Ci, C2, and C3, where C\, C2, and 
C3 use a derivation with naive unifications, a derivation with unifications by the 
Aho-Corasick algorithm, and an S-derivation with the unification by the Aho- 
Corasick algorithm, respectively. We verify the efficiency of the S-derivation and 
unifications using the Aho-Corasick algorithm, by comparing the running time 
of these interpreter with that of the definite clause grammar (DCG) provided 
by the Prolog interpreter. 

We consider the following EFS’s and DCG’s: 



' Po{xiX2) ^ Pl{xi),P2{x2)\ ' 




'po^Pi,P2; 


Pi(aaxaa) ^ Pi{x); 




Pi ^ aapiaa; 


Pi{bbxbb) ^ Pi{x); 


> ,Di= < 


Pi ^ bbpibb] 


Pi{aaaa) pi{bbbb) 




Pi aaaa; pi — > bbbb; 


P2{a) p2{aa) ^ . 




P2 ^ a; p2 ^ aa. 



f Po{xiX2X3X4aaa) ^ 

Pl{xi),Pl{x2),Pl{x3),Pl{x4); 

Po{aaa) 

P2 = i pi{axa) ^ pi{x); 

Pi{bxb) ^ Pi{x)] 

Pi{a)^] pi{b) 

[pi(aa)^; pi{bb) ^ . 



> , A>2 



Po - 


PiPiPiPiaaa; 


Po - 


aaa; 


Pi - 


apia; 


Pi - 


bpib] 


Pi - 


a; pi^ b; 


Pi - 


aa; pi bb. 



The DCG Di and the EFS Fi represent the same language {i = 1,2). 



Table 1. The running time for the EFS A and the DCG Di (sec.) 



The length of the text 


Cl 


G 2 


Gs 


DCG 


100 


18.17 


50.46 


2.03 


0.2 


200 


64.05 


195.12 


3.93 


0.4 


300 


137.86 


435.54 


5.86 


0.54 


400 


238.4 


762.86 


7.78 


0.76 


500 


367.11 


1181.39 


9.62 


0.89 



The running time by EFS interpreters for Fi and the DCG for D\ are shown 
in Table 1. The input data consist of 30 strings from {a, 6}. From the results 
of this experiment. If an EFS has successive occurrence of variables, then an 
S-derivation is more efficient than the derivation as shown by the difference 
between the running time of C2 and C3. 
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Table 2. The running time for the EES /2 and the DCG D 2 (sec.) 



The length of the text 


Cl 


C2 


C3 


DCG 


5 


8.71 


8.75 


9.25 


6.49 


10 


88.42 


17.42 


20.05 


22.43 


15 


473.15 


52.62 


73.24 


115.5 


20 


1619.83 


168.24 


351.96 


528.6 


25 


4200.69 


424.1 


1175.22 


1648.99 



In Table 2, we present the running time of each EFS interpreter and DCG, 
for T 2 and I? 2 . The input data consist of 1000 strings from {a,b}. The unification 
by the Aho-Corasick algorithm is efficient as the difference between the running 
time of Cl and C 2 - Furthermore, we find C 2 and C 3 are more efficient than DCG. 
This result represens that the number of backtracking by the EFS interpreter 
are less than that by the DCG. 

5 Conclusion 

We have proposed an efficient derivation for EFS’s called S-derivation, where all 
possible unifiers are evaluated at one step of the derivation. We have shown that 
the S-derivation is complete for accepting context-free languages. Furthermore, 
we have implemented the S-derivation, and verified its efficiency by comparing 
with the running time of DCG’s. 

One of the open problems is to discuss computability of the S-derivation for 
the extended classes of regular EFS’s. Since, in the S-derivation, each resolvent 
contains variables even if the initial goal is ground, the unification should be 
efficiently computed for terms containing variables. However, it is known that 
the unification problem for non-regular terms is NP-complete. Therefore, we 
have to consider another approach for the extended class of EFS’s. 

The S-derivation can be applied to translations over strings. We have al- 
ready constructed the translator for regular TEFS’s [12] which represent binary 
relations over context-free languages. It is a future work to formalize generat- 
ing languages by the S-derivation in the framework of TEFS’s, and to design a 
translator for real data by using our results. 
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Abstract. In this paper, we perform a worst-case analysis of rule discov- 
ery. A rule is defined as a probabilistic constraint of true assignment to 
the class attribute of corresponding examples. In data mining, a rule can 
be considered as representing an important class of discovered patterns. 
We accomplish the aforementioned objective by extending a preliminary 
version of PAC learning, which represents a worst-case analysis for clas- 
sification. Our analysis consists of two cases: the case in which we try to 
avoid finding a bad rule, and the case in which we try to avoid overlook- 
ing a good rule. Discussions on related works are also provided for PAC 
learning, multiple comparison, analysis of association rule discovery, and 
simultaneous reliability evaluation of a discovered rule. 



1 Introduction 

Data mining [2] can be defined as extraction of useful knowledge from massive 
data, and is gaining increasing attention due to advancement of various informa- 
tion technologies. Data mining can be regarded as advanced data analysis, and 
a typical process of analysis consists of several steps [2] . Pattern extraction rep- 
resents an important step in such a process. A rule is defined as a probabilistic 
constraint inherent in a data set, and is widely recognized as representing one 
of the most important patterns in data mining. 

Although rule discovery has been extensively studied in data mining, its 
theoretical analyses are surprisingly rare. Several exceptions include Agrawal et 
al.’s analysis of association rule discovery [1] and our analysis of a discovered 
rule based on simultaneous reliability evaluation [10]. However, these studies 
ignore the total number of rules that can be discovered from a data set. This 
fact represents that these studies fail to relate the size of a discovery problem 
to the number of examples needed for successful discovery, and suggests that a 
more solid foundation of data mining should be established. 

As a first step toward this objective, we extend a preliminary version of PAC 
learning [7], which represents a worst-case analysis of classification. Our analysis 
consists of two cases: the case in which we try to avoid finding a bad rule, 
and the case in which we try to avoid overlooking a good rule. We also discuss 
about related works including PAC learning [5,7], Jensen and Cohen’s multiple 
comparison [4], Agrawal et al.’s analysis of association rule discovery [1], and our 
previous analysis of a discovered rule based on simultaneous reliability evaluation 
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[10]. In the rest of this paper, technical terms and symbols of referenced papers 
are unified to those of this paper. 

2 Rule Discovery Problem 

2.1 Rule 

Let a data set contain n examples each of which is expressed by b discrete 
attributes and a class attribute. Typically rule discovery assumes no specific 
class attribute unlike classification. However, for the sake of formalization, we 
consider a rule which predicts a specific class attribute to be true. 

Let a value v assignment A = v to an attribute A be an atom. In this paper, 
we regard a given data set as a result of sampling with replacement from a true 
data set. We call the probability of examples each of which satisfies a proposi- 
tional logical formula / the true probability Pr(/) of /. Similarly, an estimated 
probability which is obtained from a given data set for Pr(/) is represented by 
Pr(/). Note that Pr(/) can be calculated by the Laplace estimate or simply 
by the ratio of examples which satisfy / in the data set. We employ the latter 
method in this paper. 

A rule r is represented as follows with a premise y which represents a propo- 
sitional formula of atoms, and a conclusion x which represents a true assignment 
to the class attribute. 



r : y ^ X 

An intuitive interpretation of r is that many examples satisfy y and those ex- 
amples are likely to satisfy x with high probability. We define Pr(j/) and Pr(a;|7/) 
as the generality and the accuracy of r respectively. Similarly, we call Pr(y) and 
Pr(a;|y) the estimated generality and the estimated accuracy of r respectively. 

2.2 Related Classes of Rules 

This section presents several classes of rules which are related to ours. A prob- 
abilistic if-then rule [9] is defined as follows, where yi represents a single atom. 

2/1 A ?/2 A • • • A j/K ^ a; 

In [10], a probabilistic if-then rule is called a conjunction rule, and this paper 
follows this paraphrasing. A conjunction rule can be regarded as a special case 
of our rule: the premise is restricted to either a single atom or a conjunction of 
atoms. 

Since a premise of a conjunction rule is represented by a combination of 
atoms, the number |i?| of possible conjunction rules is typically huge. The follow- 
ing gives \R\, where a data set contains b attributes and each of these attributes 
can have one of a values. 



|i?| = (a+l)'>_ 1 



( 1 ) 




Worst-Case Analysis of Rule Discovery 367 



This formula can be explained by the fact that each of b attributes can either 
have one of a values or be excluded from the premise. A typical value for |i?| is 
huge: for example, |i?| = 3,486,784,400 for a data set of 20 binary attributes. A 
realistic measure would be to restrict the number of atoms allowed in the premise 
to at most K. The possible number \Rk\, in this case, is given as follows. 

i=l ^ ^ 

Note that (1) can be also derived by settling K = 6 in (2) and considering the 
binary coefficients. 

In association rule discovery [1], a data set is restricted to a transactional data 
set which consists of binary attributes. A true assignment to a binary attribute 
is called an item. Let an itemset be either a single item or a conjunction of items. 
An association rule, in its original form, consists of a premise and a conclusion 
each of which is represented by an itemset. In our framework of section 2.1, an 
association rule can be regarded as a special case of our rule: the premise is 
restricted to either a single atom or a conjunction of atoms, and only the value 
“true” is allowed. The cases of |i?| and \Rk\ for association rule discovery are 
obtained by settling a = 1 in (1) and (2). 

2.3 Discovery Problem 

In this paper, the objective of a user is to obtain, with high probability 1— <5, a rule 
of which generality and accuracy are no smaller than 1 — C and 1 — e respectively. 
Typically multiple rules are obtained in rule discovery, but we restrict ourselves 
to single-rule discovery for the sake of analysis. 

Objective : Find y ^ x which satisfies 

Pr [Pr(?/) > 1 — C) Pr(a^|j/) > 1 — e] > 1 — 5 (3) 

where C, e, 5 > 0 

A discovery algorithm to be analyzed obtains a rule of which generality and 
accuracy are no smaller than user-given thresholds 0s and 9-p respectively. As 
stated above, since a given data set is a result of sampling from a true data set, 
the user employs thresholds 9^ ^ 1 — C,\ Oy ^ 1 — e ixi applying the algorithm. 

Algorithm : Find y ^ x which satisfies 

Pr(y) > Os, Pr(x|2/) > 0F (4) 

An interesting problem here is to bound the required number m of examples 
to accomplish (3) under (4). This problem can be named as PAGA (Proba- 
bly Approximately General and Accurate) discovery after the well-known PAG 
(Probably Approximately Gorrect) learning [5,7], and can be regarded as a foun- 
dation of data mining. 
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3 Case 1: Exclusion of a Bad Rule 

In this section, we derive a lower bound of the number of examples for the 
problem defined in the previous section. An assumed condition is to avoid finding 
a bad rule. This condition can be considered as important in several domains 
where reliability represents a crucial concern. 



3.1 Preliminaries 

First we introduce preliminaries which are needed in subsequent analyses. If 
the domain of a probabilistic variable X is and the probability 

distribution of the variable is represented as follows, X is said to follow a binary 
distribution [3]. 



Pr(X = k) = B(fc; m,p) 

= (7)p'(i-pr”' ( 5 ) 

where p represents a constant 0 < p < 1 and k = 0, The Chernoff 

bound states that the following holds for an arbitrary constant a > p [1]. 

Pr(X > am) < exp[— 2m(a — p)^] (6) 



3.2 Theoretical Analysis 

From (3), a bad rule : y ^ x satisfies 

Pr(p) < 1 — ^ or Pr(x|y) < 1 — e. (7) 



Since we assume, in this section, that we avoid finding a bad rule, the employed 
thresholds for generality and accuracy are relatively large. This assumption to- 
gether with (3) and (4) necessitate the following. 





> 1 — C ^nd 6p > 1 — € 


(8) 


From (7) and (8), 








9s > Pr(p) or 9p > Pr(a; p). 


(9) 


Since Vh '■ y ^ x \s 


discovered. 






Pr(p) > 9s and Pr(a; p) > 9p- 


(10) 



Let the number of examples in the given data set be m. If and only if y and 
xy are satisfied by at least |"m6*sl and |"mPr(p)0Fl examples respectively in the 
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data set, rn happens to be discovered. Since each of the numbers of examples 
which satisfy y and xy follows a binary distribution, 



Pr (rb discovered) 
< MAX 



m mPr(y) 

^ B(/c;m,Pr(?/)), ^ B(fc; mPr(j/), Pr(a;|y)) 

fc=rm(9sl k=\mVr(y)6^^ 

2 



( 11 ) 



< MAX I exp 



—2m 



\mOs^ 



- Pr(y) 



exp 



-2raFv(y) ( 

\ mPr(y) ^ 



( 12 ) 



< MAX {exp — 2m(0s — 1 + C)^ ^exp — 2m0s(^^F — 1 + e)^] } ■ (13) 



Note that, in (11), we consider separately the case in which a bad rule rbi in 
terms of generality is discovered and the case in which a bad rule Cb 2 in terms 
of accuracy is discovered. The first and second terms correspond to the left 
inequality and the right inequality of (7) respectively. Since Pr(rbi) and Pr(rb 2 ) 
are unknown, we bound Pr(rb discovered) by MAX[Pr(rbi discovered), Pr(rb 2 
discovered)]. In (12), the Chernoff bound (6) is employed from (9). Finally in 
(13), we employ (7) and the left inequality of (10). 

Let the set of all possible rules and the set of all bad rules be R and i?b re- 
spectively, and let the cardinality of a set S' be jSj. The probability of discovering 
a bad rule satisfies the following inequalities. 



Pr (i?b contains a discovered rule) 

< |i?b|MAX {exp [-2m(0s - 1 + C)^] >exp [-2m0s(^F - 1 + e)^] } (14) 

< |i?|MAX {exp [— 2m(0s — 1 + C)^] j ®^p [— 2m6*s(6*F — 1 + e)^] } (15) 

Note that we allow to count multiple times the cases in which several bad rules 
satisfy the discovery condition in (14), and (15) uses |i?| > |i?b|- Our objective 
(3) requires the following with respect to a sufficiently small S. 

|i?|MAX {exp [—2m{0s -1-1- C)^] ,exp [— 2mds(f^F — 1 + e)^] } < <5 (16) 



We obtain a lower bound of the number m of examples for discovery in which 
finding a bad rule is avoided with a high probability. 



m > 



‘"(t) 



2MIN 



(es-l + 0\9s {9p-l + ef 



(17) 



The above inequality describes influence of each parameter to the minimum 
number of examples quantitatively. As we have seen in section 2.2, |i?| is typically 
large and is thus important even if its influence is tolerated by a logarithmic 
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function. The second most important factors are 0s ~ 1 + C Etnd 0p — 1 + £• Since 
they influence the lower bound of m by the inverse of their squares, they can 
be problematic when they are small. Since each of these terms represents the 
difference of a threshold and the user-expected value, 0s — 1 + C ^nd 0p — 1 + e can 
be named as the margin of generality and the margin of accuracy respectively. In 
a typical setting of rule discovery, we can assume 0s = 0.1, and we assume that 
(0s — 1 -I- C) = 10”^ or 10“^. We also assume that (0p — 1 -I- e) = 10“^ or 10“^. 
Under these assumptions, the denominator is either 2* 10“^ or 2* 10“^. Finally, 
S can be considered as a moderately important factor in a typical situation S = 
0.01 - 0.05 since it appears only as a denominator of |i?|. 



3.3 Application to Conjunction Rule Discovery 



From (1) and (17), the lower bound of the number m of examples is given as 
follows if we restrict the discovered rule to a conjunction rule. 



m > 



In [(a +!)''-!] -fin (i) 



2MIN 



(0S-1 + C)",0S (0F-l + e)" 



(18) 



Note that settling a = 1 gives the case of association rule discovery. 

Firstly, ln(l/(5) can be typically ignored when S = 0.01 - 0.05 from ln[(a + 
1)^ — 1] ^ ln(l / 5), thus the lower bound of m is approximately proportional to b. 
Secondly, since the number a of possible values for an attribute only affects the 
right-hand side through a logarithmic function, a is typically not so important 
as b and margins of generality and accuracy. We show, in figure 1, a plot of the 
lower bound against MIN[(0s — 1 -f C)^, 0s(^^f — 1 + e)^] for b = 10^, 10^, 10^, 
where we settled a = 2 and <5 = 0.05. Note that each of the x axis and the y axis 
is represented by a logarithmic scale. 



lower bound for m 




MIN 

Fig. 1. Minimum number of examples needed for conjunction rule discovery without 
finding a bad rule. In the figure, MIN represents MIN[(0s — 1 -f C)^> 0 s(0f — 1 -f e)^]. 
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We discuss about the lower bound of the number of examples for a typical 
setting with figure 1. The examples described in section 3.2 state MIN [(0s — 
1 -I- — 1 + e)^] = 10“^ or 10“®. For these cases, the lower bound is 

approximately 5.6 * lO"* - 5.6 * 10® or 5.6 * 10® - 5.6 * 10® for b = 10^ - 10^. These 
results indicate that the required number of examples for successful discovery can 
be prohibitively large for small margins. Note that large margins represent large 
thresholds, and no rules are usually discovered for large thresholds. A realistic 
and effective measure to this problem would be to adjust thresholds according 
to a discovery process such as [11]. It should be anyway noted that our analyses 
in this paper correspond to the worst case, and the required number of examples 
in a real discovery problem can be much smaller than those mentioned above. 

From (2) and (17), the lower bound of the number m of examples is given as 
follows if we restrict the discovered rule to a conjunction rule with at most K 
atoms in its premise. 



In 




+ ln(y) 


2MIN 


'(0s-l + C)",^s (0F-l + e)"' 



(19) 



Note that settling a = 1 gives the case of association rule discovery. 

Similarly as we did in figure 1, we show, in figure 2, two plots of the lower 
bound for a = 2 and 6 = 0.05. The left plot represents a case in which we 
varied b = 10^, 10®, 10^ under K = 2, and in the right plot we varied K = 
1, 2, 3, 4, 100 (= 6) under b= 10^. 



lower bound for m 



lower bound for m 
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Fig. 2. Minimum number of examples needed for conjunction rule discovery without 
finding a bad rule, where at most K atoms are allowed in the premise. The left and 
right plots assume K = 2 and 6 = 100 respectively. 



From the left plot, we see that the influence of b is relatively small for K = 2. 
On the other hand, the right plot of figure 2 shows that, for AT < 4, the minimum 
required number of examples is smaller by approximately an order of magnitude 
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than the case of considering all conjunction rules {K = b = 100). It is widely 
accepted that a rule with a short premise exhibits high readability, and the above 
results suggest that they are also attractive in terms of the required number of 
examples. 



4 Case 2: Inclusion of a Good Rule 



In this section, we derive another lower bound of the number of examples for 
the problem defined in section 2.3. An assumed condition is to avoid overlooking 
a good rule. This condition can be considered as important in several domains 
where possibility is considered as highly important. 

From (3), a good rule r^\ y ^ x satisfies 

Pr(y) > 1 — C and Pr(x|y) > 1 — e. (20) 



Since we assume, in this section, that we avoid overlooking a good rule, the 
employed thresholds for generality and accuracy are relatively small. This as- 
sumption together with (3) and (4) necessitate the following. 

0S < 1 — C and 0 F < 1 — e ( 21 ) 

From (20) and (21), 

6*3 < Pr(y) and < Pr(a;|i/). (22) 



Let the number of examples in the given data set be m. If and only if y is 
satisfied by at most |"to 0 s] —1 examples or xy is satisfied by at most |’mPr(y) 0 Fl — 
1 examples in the data set, rg happens to be undiscovered. Since each of the 
numbers of examples which satisfy y and xy follows a binary distribution, 



Pr (rg undiscovered) 



—1 |'mPr(y)0p] —1 

< ^ B(A:; m, Pr(y)) + ^ B(fc; mPr(y), Pr(x|y)) 

/c=0 



[mSsl -1 


|'mPr(y)ep]-l 


B{k;m,Pr{y)) 


Y B(fc;TOPr(y),Pr(x|?/)) 


k^O 


k^O 



—1 |'mPr(y)0p] —1 

< ^ B(/c; m, Pr(y)) + ^ B(fc; mPr(y), Pr(x|y)) 

m 

= ^ B(fc; TO, 1 — Pr(?/)) 

k—m— [m^sl +1 

mPr(y) 

-I- ^ B(fc; TO-Pr(j/), 1 — Pr(x|?/)) 

k—mPr{y)— |'mPr(y)^p] +1 



(23) 

(24) 



( 25 ) 
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< exp 



—2m 



, - \m9s^ + 1 



-1 + Pr(y) 



• exp 



— 2mPr(y) 



/ mPr{y) — |"mPr(y)^Fl + 1 



V 



mPr(y) 

< exp {-2m [(-0s + 1 - C)^ + 0si-^F + 1 - e)^] } • 



-1-1- Pr{x\y) 



(26) 

(27) 



Note that we consider separately the cases in which the generality and the ac- 
curacy of a good rule are below the respective thresholds in (23). In (24), we 
allow to double-count the probability of the simultaneous occurrence of these 
cases, and assume that Vg is undiscovered due to apparently low accuracy in the 
second term. Note that (25) corresponds to replacement of p by 1 —p in (5). In 
(26), the Chernoff bound (6) is employed from (22). Finally in (27), we employ 
(20) and Pr(p) > 0s- Note that the last inequality holds in the second term since 
Tg is undiscovered due to apparently low accuracy. 

Similarly to section 3.2, the following can be obtained as a lower bound of the 
number m of examples for discovery in which overlooking a good rule is avoided 
with a high probability. 



m > 



In 




(— 0S + 1 — C)^ + (— 0F + 1 




(28) 



Note that (28) can be obtained by substituting in (17) the add-sum of the 
two terms (—0s + 1 — C)^j 0 s(~^f + 1 — e)^ in the denominator for the mini- 
mum of these two terms. In section 3, we had to bound Pr(rs discovered) by 
MAX[Pr(rbi discovered), Pr(rs2 discovered)] since Pr(rbi) and Pr(rb2) are un- 
known. In this section, on the other hand, we can directly calculate Pr[(rg 
undiscovered due to apprarently low generality) U (jg undiscovered due to 
apprarently low accuracy)]. The substitution is due to this difference. 

Under this condition, similar discussions as section 3.2 and 3.3 hold for (28). 
Note that large margins (1 — C — 0s and 1 — e — 0 f in this case) represent small 
thresholds in this case, and small thresholds typically result in a large number 
of candidates of the discovered rule to be inspected. The automatic adjustment 
of thresholds [11] can be also a realistic measure for this problem. 



5 Discussions on Related Topics 

5.1 PAC Learning 

PAG learning represents a worst-case analysis for classification, and has numer- 
ous excellent results. Our results in section 3.2 can be considered as an extension 
to a preliminary version of PAC learning [7]. First, a classifier ignores generality 
since it predicts the class attribute for all examples. This is the reason Pr(p)! 
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Pr(y)! 0s not considered in [7]. Actually, the objective of learning in [7] is 
represented as a simplification of (3) as follows, where h represents a classifier. 

Objective : Find h which satisfies 

Pr [Pr(/i correctly predicts x)>l — e]>l — (5 (29) 

where e, (5 > 0 



Next, [7] assumes that a classification algorithm returns a classifier which is 
consistent to all training examples. This corresponds to assuming 0p = 1. 

To sum up, compared to our study, [7] ignores the case of learning a classifier 
with low generality and the case of learning a classifier which is inconsistent to 
the training examples. In this case, application of the Chernoff bound can be 
skipped, and for a bad classifier hh, we obtain Pr(/ib learned) = (1 — e)™. In [7], 
a lower bound of the required number of examples m is given by the following, 
where H represents a set of all classifiers. 



m > 




(30) 



Note that (30) resembles to (17): it only ignores generality (0s = 1 and no (), 
assumes 0p = 1, and omits the squares in and 2 in the denominator. The last 
omissions are due to skipping the Chernoff bound. 



5.2 Jensen and Cohen’s Multiple Comparison 

Jensen and Cohen’s multiple comparison [4] proposes a prudent view of classifi- 
cation. Its essential point can be stated as a probabilistic explanation that the 
more candidates of classifiers are inspected in a learning algorithm, the smaller 
accuracy is exhibited by the obtained classifier. The multiple comparison pro- 
vides a comprehensive unified view of several studies including overfitting [8] 
and oversearching [6], and [4] also proposes several realistic measures. 

Since this study deals with classification as PAC learning, it ignores gener- 
ality. This corresponds to considering only the second term in (11). Since [4] 
considers the case of 0p < 1, it provides a more realistic framework to learn- 
ing than [7]. The multiple comparison differs from our study in that it directly 
calculates, based on a binary distribution without using the Chernoff bound, 
the probability for a bad classifier to satisfy at least |’m0p] examples. Moreover, 
they calculate exactly the probability that no bad classifier is learned while we, 
in (14), allow counting multiples times the cases in which more than one bad 
rules satisfy the discovery condition. Let the set of all bad classifiers be iJb, the 
probability in [4] is given by the following. 

Pr(iLb contains a learned classifier) = 1 — [1 — Pr(/ib learned)]'^' (31) 

Pursuing strictness in calculation can be considered as a double-edged sword. 
Jensen and Cohen give no analytical solutions to the required number of exam- 
ples for successful learning. We attribute this reason to the fact that resolving 
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(31) for m is relatively difficult. We have employed several approximations in our 
theoretical analyses, and these were necessary to bound m analytically. Another 
difference between [4] and our analyses is rather philosophical: while they are 
pessimistic about classification, we are realistic about rule discovery. The study 
in [4] emphasizes that |i?| is huge, and demonstrates various examples in which 
it is difficult to avoid learning a bad classifier. We also recognize that |i?| is huge, 
but bounds the required number of examples m analytically with respect to |i?|. 



5.3 Theoretical Analysis of Association Rule Discovery 

Analyses of association rule discovery [1] are threefold: a lower bound of the 
number of queries under the use of a database system, the expected number of 
itemsets each of which is satisfied by at least a required number of examples 
in a random data set, and the number of examples satisfied by an itemset in a 
sampled data set. The third analysis is highly related to our study in that both 
of the two deal with the case of sampling m examples from a true data set in 
rule discovery. 

The analysis provides a specification of the Chernoff bound (6), where X 
is regarded as mPr(/) for an itemset /. It first regards the right-hand side 
exp[— 2m(a — p)'^] as the upper bound of the probability for Pr(/) to deviate at 
least a — p from its value p (= Pr(/)) in the true data set. Next, it gives several 
examples of values for a — p and S in exp[— 2m(a — p)'^] = S, and represents the 
corresponding values of m in a table. 

The discovery algorithm employed in [1] first obtains, by an algorithm called 
Apriori, a set of imtemsets / each of which satisfies Pr(/) > 0s- Then, it gener- 
ates a set of association rules from this set. One of the motivations of the above 
analysis was to reduce the run-time of Apriori by the use of a sampled data set. 
Due to this motivation, [1] ignores accuracy unlike our study. Moreover, since it 
considers a single association rule, the study fails to relate the size of a discovery 
problem to the number of examples needed for successful discovery. 



5.4 Simultaneous Reliability Evaluation of a Discovered Rule 

Simultaneous reliability evaluation of a discovered rule [10] also deals with the 
case of sampling m examples from a true data set in rule discovery as in section 
5.3 and our study. Unlike the analysis in section 5.3, this study considers both 
generality and accuracy. 

The objective considered in [10] is identical to ours, and is represented by (3). 
Let X represent the negation of x. The analysis fixes m and employs neither 0s 
nor 0p. It assumes that (mPr(xy),mPr(xy)) follows a two-dimensional normal 
distribution, and obtains the exact condition for accomplishing the objective 
analytically. This is a different framework from ours: we use a discovery algo- 
rithm with fixed thresholds 0s, 0p in (4) and bound the number m of sampled 
examples. The problem dealt in [10] can be reduced to the problem of deriving 




376 E. Suzuki 



and analyzing two tangent lines of an ellipse, and applying Lagrange’s multiplier 
method gives the following analytical solutions. 



1 - p{6) 

\ 

/ 

1 - / 3 ( 5 ) 



1 - Pr(y) 

^ nPr(y) 



Pr(y) > 1 - C 



Pr{x,y) 

\ Pr(a;,2/){(?^ + /3('J)^)Pr(y) -/3('J)^} 



(32) 

Pr(a;|y) > 1 — e (33) 



Here (3{6) represents a positive constant which defines the size of a 1 — <5 con- 
fidence region i.e. the ellipse for (mFr{xy),rnFr(xy)), and can be obtained by 
a simple numerical integration. Note that (32) and (33) represent conditions for 
generality and accuracy respectively. Each of them states that the corresponding 
estimated probability multiplied by a coefficient which is related to the size of 
the confidence region is no smaller than the corresponding user-expected value 
(1 — C or 1 — e). 

Since the study [10] assumes a specific distribution to the simultaneous occur- 
rence of random variables, it does not fall in the category of worst-case analysis. 
Similarly to the analysis in section 5.3, the study fails to relate the size of a 
discovery problem to the number of examples needed for successful discovery. 



6 Conclusions 

The main contribution of this paper is threefold. 1) We formalized a worst-case 
analysis of rule discovery. The proposed framework employs thresholds 9s, dp 
for generality and accuracy which are different from user-expected values 1 — 

1 — e respectively. We considered the case in which we try to avoid finding a 
bad rule, and the case in which we try to avoid overlooking a good rule. 2) 
We derived a lower bound of the number m of required examples. By using 
probabilistic formalization and appropriate approximations, two lower bounds 
are obtained for the aforementioned two cases. Quantitative analysis of one of the 
lower bounds revealed that the total number |i?| of rules, the margin 0s — 1 -I- C 
for generality, and the margin 0p ~ 1 + e for accuracy are important. 3) We 
analyzed one of the lower bounds for a set of specific problems of conjunction 
rule discovery. Various useful conclusions are obtained by inspecting the lower 
bound for a set of typical settings. 

The contribution of 1) represents that this paper has provided, in rule dis- 
covery, a framework which corresponds to PAC learning. This framework can 
be named as PAGA (Probably Approximately General and Accurate) discovery. 
PAGA discovery can be regarded as promising as a theoretical foundation of 
active mining, which requests new examples in a discovery process. The con- 
tributions of 2) and 3) suggest various useful policies in applying various rule 
algorithms in practice. Such policies include sampling/extension of a data set 
and modification of the class of discovered rules. We can safely conclude that 
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our comprehension to rule discovery has deepened with these contributions and 
discussions in section 5. Ongoing work focuses on analyses of more realistic al- 
gorithms, especially an algorithm which discovers multiple rules with various 
conclusions. 
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Abstract. A new data model for filtering semi-structured texts is pre- 
sented. Given positive and negative examples of HTML pages labeled 
by a labelling function, the HTML pages are divided into a set of paths 
using the XML parser. A path is a sequence of element nodes and text 
nodes such that a text node appears in only the tail of the path. The 
labels of an element node and a text node are called a tag and a text, 
respectively. The goal of a mining algorithm is to hnd an interesting 
pattern, called association path, which is a pair of a tag-sequence t and 
a word-sequence w represented by the word-association pattern [1]. An 
association path {t, w) agrees with a labelling function on a path p if t is 
a subsequence of the tag-sequence of p and w matches with the text of p 
iff p is in a positive example. The importance of such an associate path 
a is measured by the agreement of a labelling function on given data, 
i.e., the number of paths on which a agrees with the labelling function. 
We present a mining algorithm for this problem and show the efficiency 
of this model by experiments. 



1 Introduction 

In the information extraction, it is one of the central problems in Web mining 
to detect the occurrences or the regions of useful texts. In case of the Web data, 
this problem is particularly difficult because we can not represent a rich logical 
structure by the limited tags of the HTML. The framework of wrapper induction 
introduced by Kushmerick [13] is a new approach to handle this difficulty. The 
most interesting result of his study is to show the effectiveness and efficiency 
of simple wrappers with string delimiters in the information extraction tasks. 
Together with his work, we can find other extracting models, for example, in [8, 
9,10,11,15,17]. 

The target class, called HTML pages, of the wrapper induction model is 
restricted such that a page is defined by a finite repetition of a sequence of 



K.P. Jantke and A. Shinohara (Eds.): DS 2001, LNAI 2226, pp. 378—388, 2001. 
@ Springer- Verlag Berlin Heidelberg 2001 




Mining Semi-structured Data by Path Expressions 379 



attributes. The attributes are the data which an algorithm has to extract. In a 
learning model, a learning algorithm takes an input of labeled examples such 
that the labels indicate whether they are positive data or negative data. The 
strategy is useful to learn a concept for the wrapper class. 

However, in case that a concept class is hard to learn by a small number of 
examples, the model may not be effective. This difficulty is critical in the point 
of implementation since the labelling examples are actually made by human 
inspection. Thus, we would like to present a mining model to decide which 
portion of a given data is important and an automatic process to construct a 
large labelling sample. 

The aim of this paper is to find rules for filtering semi-structured texts ac- 
cording to users interests. An HTML/XML file can be considered as an ordered 
labeled tree. We assume that each node is either an element node and a text 
node. Each node has two types of labels called the name and value. An element 
node corresponds to a tag. The name of the node is the tag name like <HTML>, 
<br>, and <a>, and the value of the node is empty. A text node corresponds to a 
portion of a plain text in an HTML and the name is the reserved string j^Text, 
respectively. The value of a text node is the text. 

A filtering rule is a sequence s = (oi, . . . , ak,(3), where at is a tag name, /3 
is a word-association pattern [1] which is a string consisting of several words and 
the wild card *. A word-association pattern matches with a string if there is a 
possible substitution for all *. Given the s and a semi-structured text, using an 
XML parser, we can easily construct the tree structure and decompose the tree 
into the set P of paths. Each path contains at most one text node in the tail. 
The semantics of the filtering rule s for P is defined as follows. For each p G P, 
s matches with p if oi, . . . , is a subsequence of the sequence of tag names of 
p and the tail of p is a text node such that (3 matches with the value of the node. 

Such a filtering rule is considered as a simple decision tree to extract texts 
from paths in HTML trees. Each represents a test on a node. Unless the test 
is failed, we continue the test to the next test a^+i. Finally, the value of the text 
node is extracted according to the pattern jS. In other words, this rule is a pair 
of tag patterns and association patterns (a, /3), where a tag pattern is a sequence 
a = (oi, . . . , ttfe) of tag names such that these tags frequently appear in positive 
examples together with the association pattern. Such a filtering rule is called an 
association path in this paper. We can use this notion for a measure to decide 
the importance of keyword in a text. We show the efficiency of the association 
paths by experiments. 

This paper is organized as follows. In Section 2, we define the data model 
for HTML pages, HTML trees, and path expressions. In Section 3, we review 
the definition of the word-association pattern in [I] and formulate the mining 
problem, called Association Path problem, of this paper. Next we describe a 
mining algorithm which finds an association path for given a large collection of 
HTML pages. In Section 4, we show several experimental results. In the first ex- 
periment, the set of positive examples is a collection of HTML texts containing 
a keyword “TSP” and the set of negatives is that containing “NP” . These key- 
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words mean the travelling sealsman problem and NP-optimization problem on the 
computational complexity theory, respectively. The aim is to find an association 
path to characterize the notion TSP comparing to NP. In this experiment, the 
algorithm found some interesting association paths. For the next experiment, we 
choose the keyword “DNA” for positive examples. Compared to the first result, 
the algorithm found few interesting paths. In Section 5, we conclude this study. 

2 The Data Model 

In this section, we introduce the data model considered in this paper. First, we 
begin with the notations used in this paper. IN denotes the set of all nonnegative 
integers. An alphabet A is a set of finite symbols. A finite sequence (oi, . . . , a„) 
of elements in S is called string and it is denoted by ru = ai • • • a„ for short. The 
empty string of length zero is e. The set of all strings is denoted by S* and let 

= S* — {e}. For string rc, if ru = O/dy, then the strings a and 7 are called 
a prefix and a sujfix of w, respectively. For a string s, we denote by s[t] with 
1 < i < |s| the i-th symbol of s, where |s| is the length of s. 

For an HTML page, the HTML trees are the ordered node-labeled trees 
defined as follows. For each tree T, the set of all nodes of T is a finite subset 
of IN, where the 0 is the root. A node is called a leaf if it has no child and 
otherwise called an internal node. If nodes n,m G IN have the same parent, 
then n and m are siblings. A sequence (ni, . . . , n^) of nodes of T is called a path 
if ni is the root and Ui is the parent of Ui+i for alH = 1, . . . , A: — 1. For a path 
p = (ni, . . . , nfc), the number k is called the length of p and the node is called 
the tail of p. 

With each node n, the pair NL{n) = {N{n),V{n)), called the node label of 
n, is attached, where N{n) and V (n) are strings called the node name and node 
value, respectively. If N{n) G A+ and V(n) = e, then the node n is called the 
element node and the string N{n) is called the tag. If N{n) = 'fPext for the 
reserved string %Iext and V{n) G A+, then n is called the text node and the 
V{n) called the text value. We assume that every node n G W is categorized to 
the element node or text node. 

If a page P contains a beginning tag of the form <tag> and P contains no 
ending tag corresponding to it. Then, the tag <tag> is called an empty tag in 
P. If a page P contains a string of the form t\ ■ w ■ t 2 such that ti,t 2 are either 
beginning or ending tags and w is a string not containing any tag, then the string 
w is called a text in P. 

An HTML file is called a page. A page P corresponds to an ordered labeled 
tree. For the simplicity, we assume that the P contains no comments, which is 
any string beginning the string <! - and ending the string ->. 

Definition 1. For a page P, we define the HTML tree Pt recursively as follows. 

1. If P contains an empty tag of the form <tag>, then Pt has the element node 

n such that it is a leaf of P, N{n) = tag, and V (n) = £. 
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2. If P contains a text w, then Pt has the text node n such that it is a leaf P, 
N{n) = ‘iText, V{n) = w. 

3. If P contains a string of the form <tag>s</tag> for a string s G S* , then Pt 
has the subtree n{ni, . . . ,Uk), where N{n) = tag, V{n) = s and ni,. . . ,Uk 
are the roots of the trees ti, ... ,tk which are obtained from the w by recur- 
sively applying the above 1, 2 and 3. 

Next we define the functions to get the node names, node values, and HTML 
attributes from given nodes and HTML trees defined above. These functions are 
useful to explain the algorithms in the next section. These functions return the 
values indicated below and return null if such values do not exist. 

— Parent {n): The parent of the node n G IN. 

— ChildNodes{n): The sequence of all children of n. 

— Name{n)\ The node name N(n) of n. 

— Value{n)\ The concatenation V (ni) • • • H (n^) for the leaves rii, . . . , of the 
subtree rooted at n in the left-to-right order. 

Recall that V(n) is not empty only if n is text node. Thus, Value (n) is equal 
to the concatenation of values of all text nodes below n. Let Pt be an HTML 
tree for a page P and let N = {0, . . . ,n} be the set of nodes in Pt. For nodes 
i,j G N, if there is a sequence ptj = {ii, . . . ,ik) of nodes in N such that ii = i, 
ik = j, and ii = Parent{ii+i) for all 1 < £ < fc — 1, then the pij is called the 
path from i to j. If i is the root, then pi^ is denoted by pj for short. For each 
path p = {ii, . . . ,ik) of Pt, we also define the following useful notations. 

— Name{p)\ The sequence {Name{ii), . . . ,Name{ik)). 

— Value{p)\ V{nk). 



Definition 2. Let Pt be an HTML tree over the set N of nodes. Let p = 
{ii, . . . ,i„) be a path of Pt and let Namet = {Name{n) \ n G N}. A sequence 
a = {namei, . . . ,namem), {namci G Namet) is called a path expression over 
Namet. It is called that the a matches with the p if there exists a subsequence 
Ji) • ■ • ) jm of p such that Name{ji) = namei for all 1 < ^ < m. 

In the next section, we define a measure of the matching of the path expres- 
sions with the paths of HTML trees. We also define the finding problem of a 
path expression to maximize the measure. 



3 Mining HTML Texts 

In this section we first define the problem to find an expression, called an as- 
sociation pattern, for filtering semistructured texts. The pattern is a pair of a 
word-association pattern and a path expression. The semantics of the patterns 
is defined by the matching semantics of the word-association patterns and the 
path expressions. 
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3.1 The Problem 

A word-association pattern [1] tt over A is a pair tt = (pi, . . . ,pd] k) of a finite 
sequence of strings in S* and a parameter k called proximity which is either a 
nonnegative integer or infinity. A word-association pattern tt matches a string 
s S A* if there exists a sequence ii,. . . ,id of integers such that every pj in tt 
occurs in s at the position ij and 0 < ij+i — ij < fc for all 1 < j < d — 1. The 
notion (d,k) -pattern refers to a d-word fc-proximity word-association pattern 

{pi,...,Pd;k). 

Let S = {si, . . . , Sm} be a finite set of strings S* and let ip he a, labeling 
function ip : S —>■ {0, 1}. Then, for a string s G S', we say that a word-association 
pattern tt agrees with ip on s \i tt matches s iff ip{s) = 1. 

Given {S,S,ip,d,k) of an alphabet S, a finite set S C S* of strings, a 
labeling function ip : S {0, 1}, and positive integers d and k, the problem 

Max Agreement by (d, fc)-PATTERN [1] is to find a (d, fc)-pattern tt such that 
it maximizes the agreement of ip, i.e., the number of strings in S on which tt 
agrees with ip. 

Definition 3. An association path is an expression of the form appir, where the 
a is a path expression such that its tail is jJText, the tt is a word-association 
pattern, and the 4P is the special symbol not belonging to any a and tt. Let 
p = app'K be an association path and p' be a path in a tree. It is said that the p 
matches the p' if a matches p' and tt matches Value {p'). 

For a finite set T of HTML trees, let 

Textx = {{Name{p), Value{p)) | p is a path of t G T, Value{p) yf e} 

The intuitive meaning of p appearing in Textr is a path p of an HTML tree 
such that the tail of p is a text node. Let Namex be the set of Name{p) and let 
Valuer be the set of Value{p) in Textr- 

Definition 4. Association Path 

An instance is (A, Textx, ip, d, k) of an alphabet A, a set Textx of pairs for a finite 
set T of HTML trees, a labeling function ip : Valuer {0, 1}, and positive inte- 
gers d, k. A solution is an association path a#7r. The string tt is a (d, /c)-pattern 
for a solution of the max agreement problem for input (A, Valuer, ip, d, k). The 
string a is a (d, fc)-pattern for a solution of the max agreement problem for 
input {E, Namer,ip' , d, k) such that where ip is defined by ip'{Name{p)) = 1 
iff ip{Value{p)) = 1. The goal of the problem is to maximize the sum of the 
agreements of ip and ip' over all association paths appir. 

3.2 The Algorithm 

To find association paths, the data of HTML texts are transformed to path 
expressions as follows. Given a large set S of HTML texts, it is divided into two 
disjoint sets 5*1 and S 2 by a labeling function. The labeling function is considered 
as a keyword or phrase by a user, i.e., any text in S is labeled by 1 if it contains 
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the keyword and labeled by 0 otherwise. Next all texts in and S 2 are parsed 
to HTML trees and let Pos be the set all paths from Si and Neg be the set of 
all paths from S' 2 . Fig. 3.2 shows the process of our algorithm briefly. 



Keyword Mining 




Fig. 1. The process of mining algorithm. 



Algorithm Path-Find(Z’, Text, '0, d, k) 

/* Input: a set of HTML pages P over S, a labeling function -0, non negative 
integers d, k */ 

/* Output: a solution of Association Path for the input*/ 

1. Let Pi be the set of all pages in P labeled by 1 and let P 2 = {P — Pi). For 
the set Ti of HTML trees of Pi, compute the set Pos of all paths of trees 
in Pi and the set Neg of all paths of all trees in P 2 - 

2. Let Pos = {pi I 1 < i < m} and Neg = {qj | 1 < J < n} (m, n > 0). 
Compute the sets Namepos = {Name{p) \ p € Pos}, 

Valuepos = { Value{p) \ p G Pos}, NameNeg = {Name{q) \ q G Neg}, and 
ValueNeg = { Value(q) \ q G Neg}. 

3. Find a (d, fc)-pattern tt of the max agreement problem for 
{Valuepos, ValueNeg), and And a (d, fc)-pattern a of the max agreement 
problem for {Name pos, Name Neg)- 

4. Output the pattern which maximizes the sum of the agreement of a 
and 7T. 



We estimate the running time of the Path-Find. This algorithm finds an 
association path for only the paths whose tails are the text nodes, i.e., the paths 
of the form p = (ni, . . . , Uk), the rii (1 < j < fc — 1) is an element node and the 
rife is a text node. Thus, for such paths p, we regard the mining problem as the 
problem to And two phrases a from the strings Name{p) and (3 from the strings 
Value {p) for constant parameters d of the number of phrases of texts and k of 
the distance of phrases. 

If the maximum number of phrases in a pattern is bounded by a constant d 
then the max agreement problem for (d, fc)-patterns is solvable by Enumerate- 
Scan algorithm [19], a modification of a naive generate-and-test algorithm, in 
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0{n‘^) scans although it is still too slow to apply real world 

problems. 

Adopting the framework of optimized pattern discovery, we have developed 
an efficient algorithm, called Split-Merge [1], that finds all the optimal patterns 
for the class of {d, /c)-patterns for various statistical measures including the clas- 
sification error and information entropy. 

The algorithm quickly searches the hypothesis space using dynamic recon- 
struction of the content index, called a sujfix array with combining several tech- 
niques from computational geometry and string algorithms. 

We showed that the Split-Merge algorithm runs in almost linear time in 
average, more precisely in N {log time using 0{k’^~^N) space for 

nearly random texts of size N [1]. We also show that the problem to find one 
of the best phrase patterns with arbitrarily many strings is MAX SNP-hard [1]. 
Thus, we see that there is no efficient approximation algorithm with arbitrary 
small error for the problem when the number d of phrases is unbounded. 



4 Experimental Results 

In this section, we show the experimental results. The text data is a collection 
from the Researchindex ^ which is a scientific literature digital library. A positive 
data is the set Pos of HTML pages containing the keyword “TSP” and a negative 
data is the set Neg of HTML pages containing the keyword “NP”. The set 
Neg consists of many topics of computational complexity problems and Pos is 
concerned with one of the most popular NP-hard problems Travelling Salesman 
Problem not properly contained in Neg. The aim of this experiment is to find an 
association path which characterizes TSP with NP. 

By this experiment on the collection of 8.4MB, the algorithm Path-Find 
finds the best 600 patterns at the entropy measure in seconds for d = 2 and 
three minutes for d = 3 with A: = 10 words using 200 mega-bytes of main 
memory on IBM PC (Pentiumlll 600 MHz, gcc-|— I- on Windows98). The result 
obtained by our algorithm is shown in Fig. 1. 

Our system found several interesting association paths which may be difficult 
for human users to find by inspection. Fig. 1 consists of some association paths 
whose tag sequences contains <i> tag. This means that the phrases, e.g., ‘local 
search’ and ‘euclidean tsp’, are emphasized by the tag. Thus we consider these 
phrases to be interesting. In fact these phrases are remarkable by the following 
reasons. 

The phrase ‘local search’ in Rank 171 indicates the local search heuristics 
for TSP such as [14]. In this path, the tag <i> and <font> (font style and size) 
in the left hand side indicates the importance of the phrase <local search> in 
the right hand side. The phrase ‘tsp and other’ in Rank 276 is a substring of the 
title of the outstanding paper written by Arora [2] in 1996 on the approximation 
algorithm for Euclidean TSP. The euclidean graph is an important geometric 

^ http://citeseer.nj.nec.com/ 
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Rank Association path a#7r 



5 < i font p body html> # <tsp > 

38 < i font p body html> # <for the > 

90 < i font p body html> # <the tsp > 

171 < i font p body html> # <local search > 

213 < i font p body html> # <traveling > 

276 < i font p body html> # <tsp and other > 

394 < i font p body html> # <euclidean tsp > 

455 < i font p body html> # <other geometric problems > 

552 < i > # <approximation schemes for euclidean > 



Fig. 2. The association paths found in the experiments, which characterize the Web 
pages on the TSP problem from these on NP-optimization problm. The parameters are 
(2, 10) for (d, k), where a is a path and tt is a phrase. 



structure to construct an approximation algorithm for TSP. These keywords 
appear in Rank 394, 455, and 552, respectively. 

Next we examine the same text data by the association pattern algorithm [1] 
and compare the resulting phrases with our result. The list of 400 phrases found 
by the association pattern algorithm is partially presented in Fig. 1. As is shown 
in this list, almost phrases are trivial except ‘local search’. 



0 


<the Xtsp > 




10 


<tsp Xand > 


1 


«font Xtsp 


> 


11 


<local search Xthe > 


2 


<for Xtsp > 




12 


<the Xsalesman > 


3 


<and Xtsp > 




13 


<and Xnp > 


4 


<tsp Xin > 




14 


<the Xnp > 


5 


<tsp Xof > 




15 


<and Xlocal search > 


6 


<tsp Xthe > 




16 


<tsp Xa > 


7 


<of Xtsp > 




17 


<np Xthe > 


8 


<= Xfor the 


> 


18 


<for Xtraveling > 


9 


«font Xfor 


the > 


19 


<local Xthe > 



Fig. 3. The top 20 patterns of 400 association patterns found by the algorithm in [1]. 
The parameter is (2, 10) for (d, k). 



Moreover it is difficult to recognize the importance of ‘local search’ by this result 
only because it is an ordinary phrase in computer science. On the other hand, we 
show all phrases containing the emphasis tag <i> in Fig. 1. Compared with Fig. 1, 
we can comfirm that none of the important phrases ‘local search’, ‘euclidean tsp’. 
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‘geometric’, and ‘approximation’ appears in the list of Fig. 1. Thus we conclude 
the effectiveness of our algorithm on this examination. 



14b <<i>the Xtsp > 

212 <<i>the Xsolutions for tsp > 

213 <<i>the Xfor tsp > 

214 <<i>the traveling Xtsp > 

215 <<i>the traveling Xsolutions for tsp > 

216 <<i>the traveling Xfor tsp > 

240 <<i>the X computational solutions for tsp > 

241 <<i>the traveling X computational solutions for tsp > 
256 <<font >«i>the traveling > 



Fig. 4. All association patterns containing <i> tag found by the algorithm in [1]. The 
parameter is (2, 10) for {d, k). 



Finally, we show other experimental results. The positive sample is a set of 
HTML pages containing the keyword “DNA” and the negative sample is the 
same to above experiment. By this experiment on the collection of 9.3MB, the 
algorithm finds the best 600 patterns at the entropy measure in seconds for 
d = 2 with k = 10. The result is shown in Figure 3. Our system found few of 
association paths containing interesting keywords like “sequence” “computer” 
and “molecular” . In this result, several paths containing the anchor tag. Unfor- 
tunately, interesting keywords are not found in such paths. 



Rank Association path a#7r 



23 <a > # <dna > 

136 <i font p body html > # <dna sequences > 

199 <i font p body html > # <molecular > 

360 <i font p body html > # <computer > 

395 <a > # <computation > 

444 <a body html > # <computing > 



Fig. 5. Other result of the experiments for the DNA from NP-optimization problem. 
The parameters are also (2, 10) for (d, fc), where a is a path and tt is a phrase. 
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5 Conclusion 

We introduced a new method for mining from HTML texts and present an al- 
gorithm to find an association path which is a pair of association patterns over 
tag sequences and text sequences. By experiments on HTML data of scientific 
literature, the algorithm found interesting association paths from positive and 
negative examples on the traveling salesman problem and the other NP opti- 
mization problems. 



Acknowledgments 

The authors would be grateful to the anonymous referees for their careful read- 
ing of the draft and uesful comments. Shinichi Shimozono thanks Miho Matsui 
for the suggestive discussions and observations obtained while supervising her 
graduation thesis. 

References 

1. Shimozono, S., Arimura, H., and Arikawa, S. Efficient discovery of optimal word- 
association patterns in large text databases. New Generation Computing 18:49-60, 
2000 . 

2. Arora, S. Polynomial-time approximation schemes for Euclidean TSP and other 
geometric problems. Proc. 37th IEEE Symposium on Foundations of Computer 
Science, 2-12, 1996. 

3. Abiteboul, S., Buneman, P., and Suciu, D. Data on the Web: From relations to 
semistructured data and XML, Morgan Kaufmann, San Francisco, CA, 2000. 

4. Angluin, D. Queries and concept learning. Machine Learning 2:319-342, 1988. 

5. Buneman, P., Davidson, S., Hillebrand, G., and Suciu, D. A query language and 
optimization techniques for unstructured data. University of Pennsylvania, Com- 
puter and Information Science Department, Technical Report MS-CIS 96-09, 1996. 

6. Cohen, W. W. and Fan, W. Learning Page-Independent Heuristics for Extracting 
Data from Web Pages, Proc. WWW-99. 1999. 

7. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., 
and Slattery, S. Learning to construct knowledge bases from the World Wide 
Web, Artificial Intelligence 118:69-113, 2000. 

8. Freitag, D. Information extraction from HTML: Application of a general machine 
learning approach. Proc. the 15th National Conference on Artificial Intelligence, 
517-523, 1998 

9. Grieser, G., Jantke, K. P., Lange, S., and Thomas, B. A unifying approach to 
HTML wrapper representation and learning, Proc. the 3rd International Confer- 
ence, DS2000, Lecture Notes in Artihcial Intelligence 1967:50-64, 2000. 

10. Hammer, J., Garcia-Molina, H., Cho, J., and Crespo, A. Extracting semistructured 
information from the Web. Proc. Workshop on Management of Semistructured 
Data, 18-25, 1997. 

11. Hsu, C.-N. Initial results on wrapping semistructured web pages with finite-state 
transducers and contextual rules. Proc. 1998 Workshop on AI and Information 
Integration, 66-73, 1998. 

12. Kamada, T. Compact HTML for small information appliances. W3C NOTE 09- 
Feb-1998. www.w3.org/TR/1998/N0TE-compactHTML-19980209, 1998. 




388 K. Taniguchi et al. 



13. Kushmerick, N. Wrapper induction: efficiency and expressiveness. Artificial Intel- 
ligence 118:15-68, 2000. 

14. Lin, S., and Kernighan, B. W. An effective heuristic algorithm for the travelling 
salesman problem. Operations Research 21:498-516, 1973. 

15. Muslea, I., Minton, S., and Knoblock, C. A. Wrapper induction for semistruc- 
tured, web-based information sources. Proc. Conference on Automated Learning 
and Discovery , 1998. 

16. Sakamoto, H., Arimura, H., and Arikawa, S. Identification of tree translation rules 
from examples. Proc. the 5th International Colloquium on Grammatical Inference, 
LNAI 1891:241-255, 2000. 

17. Thomas, B. Anti-unification based learning of T-Wrappers for information extrac- 
tion, Proc. AAAI Workshop on Machine Learning for IE, 15-20, AAAI, 1999. 

18. Valiant, L. G. A theory of the learnable. Comm. ACM 27:1134-1142, 1984. 

19. Wang, J. T., Chirn, G. W., Marr, T. G., Shapiro, B., Shasha, D., and Zhang, K. 
Combinatorial pattern discovery for scientific data: Some preliminary results. Proc. 
SIGMOD’94, 115-125, 1994. 




Theory Revision in Equation Discovery 



Ljupco Todorovski and Saso Dzeroski 



Department of Intelligent Systems, Jozef Stefan Institute 
Jamova 39, 0.50 Ljubljana, Slovenia 
Ljupco .TodorovskiSijs . si , Saso .DzeroskiSijs . si 



Abstract. State of the art equation discovery systems start the discov- 
ery process from scratch, rather than from an initial hypothesis in the 
space of equations. On the other hand, theory revision systems start from 
a given theory as an initial hypothesis and use new examples to improve 
its quality. Two quality criteria are usually used in theory revision sys- 
tems. The first is the accuracy of the theory on new examples and the 
second is the minimality of change of the original theory. In this paper, 
we formulate the problem of theory revision in the context of equation 
discovery. Moreover, we propose a theory revision method suitable for 
use with the equation discovery system Lagramge. The accuracy of the 
revised theory and the minimality of theory change are considered. The 
use of the method is illustrated on the problem of improving an exist- 
ing equation based model of the net production of carbon in the Earth 
ecosystem. Experiments show that small changes in the model parame- 
ters and structure considerably improve the accuracy of the model. 



1 Introduction 

Most of the existing equation discovery systems make use of a very limited 
portion of the theoretical knowledge available in the domain of interest. Usually, 
the domain knowledge is used to constrain the search space of possible equations 
to the equations that make sense from the point of view of the domain experts. 
One of the aspects of the domain knowledge that is usually neglected by the 
equation discovery systems are the existing models in the domain. Rather than 
starting the search with an existing equation based model, equation discovery 
systems always start their search from scratch. In contrast with them, theory 
revision systems [9,3] start with an existing model and use heuristic search to 
revise the model in order to improve its fit to observational data. 

Most of the work on theory revision systems is on the revision of theories 
in propositional and first-order logic [9]. In this paper, we propose a flexible 
grammar based framework for theory revision in equation discovery. The ex- 
isting initial model is transformed to a grammar, and alternative productions 
are used to define a space of possible revised equation models. The grammar 
based equation discovery system Lagramge [6] is then used to search through 
the space of revised models and find the one that fits observational data best. 
The use of the proposed framework is illustrated on revising an equation based 
earth-science model of the net production of carbon in the Earth ecosystem. 
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The paper is organized as follows. The following section give a brief introduc- 
tion to grammar based equation discovery. Typical approaches to revision of the- 
ories in propositional and first-order logic are briefly reviewed in Section 3. The 
grammar based framework for theory revision in equation discovery is presented 
in Section 4. Section 5 presents the experiments with revising the earth-science 
equation model. The last section summarizes the paper, discusses related work 
and gives direction for further work. 

2 Equation Discovery 

Equation discovery is the area of machine learning that develops methods for 
automated discovery of quantitative laws, expressed in the form of equations, in 
collections of measured data [1]. Equation discovery systems heuristically search 
through a subset of the space of all possible equations and try to And the equation 
which fits the measured data best. 

Different equation discovery systems explore different spaces of possible equa- 
tions. Early equation discovery systems used pre-deflned (built-in) spaces that 
were small enough to allow effective heuristic (or exhaustive) search. However, 
this approach does not allow the user of the equation discovery system to tailor 
the space of possible equation to the domain of interest. On the other hand, 
recent equation discovery systems use different approaches to allow the user to 
restrict the space of the possible equations. In equation discovery systems that 
are based on genetic programming, the user is allowed to specify a set of algebraic 
operators that can be used. A similar approach has been used in the EF [10] 
equation discovery system. The equation discovery system SDS [7] effectively 
uses user provided scale-type information about the dimensions of the system 
variables and is capable of discovering complex equations from noisy data. 

Finally, the equation discovery system Lagramge [6] allows the user to 
specify the space of possible equations using a context free grammar. Note that 
grammars are a more general and powerful mechanism for tailoring the space 
of the equations to the domain of use than the ones used in SDS [7] and EF 
[10]. In the rest of this section we will describe this grammar based approach to 
equation discovery used in Lagramge. 

2.1 Grammar-Based Equation Discovery 

The problem of grammar based equation discovery can be formalized as follows. 

Given: 

— a set of variables V = vi,V 2 , ■ ■ ■ ,v„ of the observed system, including a 
target dependent variable Vd GV; 

— a grammar G; and 

— a table M of observations (measured values) of the system variables. 

Find a model E in the form of one or more algebraic or differential equations 
defining the target variable Vd that: 

1. is derived by the grammar G; and 
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2. minimizes the discrepancy between the observed values of the target variable 

Vd and the values of Vd obtained with simulating the model. 

An example of a grammar for equation discovery is given in Table 1. The 
grammar contains a set of two nonterminal symbols {P_Vdiff , Vdiff}, with a 
set of productions attached to each of them, and a set of three terminal symbols 
{vl , v2 , const [0:1]}. The semantics of the terminal and nonterminal symbols 
in the grammar are explained below. 

There are two types of terminal symbols used in the grammars for equation 
discovery. The first group is used to denote the variables of the observed system 
(vl and v2 in the example grammar from Table 1). Another group of terminal 
symbols of the form const [l:h] is used to denote the constant parameter in 
the equation model whose value has to be fitted against the observational data 
from M. A constraint [l:h] specifies that the value of the constant parameter 
should be within the interval I < v < h. 



Table 1. An example of a grammar for equation discovery defining the space of poly- 
nomials of a single variable vdiff = vi — V 2 - 

P_Vdiff -> const [0:1] 

P.Vdiff -> const [0:1] + (P.Vdiff) * (Vdiff) 

Vdiff -> vl - v2 



The nonterminal symbol Vdiff defines an intermediate variable which is the 
difference between two system variables vl and v2. This is done with the single 
production for the nonterminal symbol Vdiff. The other nonterminal symbol 
P_Vdiff is used to build polynomials of an arbitrary degree. 

2.2 Lagramge 

The equation discovery system Lagramge applies heuristic (or exhaustive) 
search through the space of models generated using user provided grammar 
G. The values constant parameters (terminal symbols const) in the generated 
models are fitted against input data M using standard non-linear constrained 
optimization method. After fitting the values of the constant parameters tho 
model is evaluated according to the sum of squared errors (SSE heuristic func- 
tion [6]), i.e., the differences between observed values of the target variable Vd 
and the values of Vd calculated by the model. Alternative MDL heuristic function 
that takes into account the complexity of the model can be also used [6]. 

3 Theory Revision 

The problem of theory revision can be defined as follows: Given an imperfect 
domain theory in the form of classification rules and a set of classified examples, 
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find an approximately minimal syntactic revision of the domain theory that 
correctly classifies all of the examples. 

A representative system that addresses this problem is Either [3]. Either 
refines propositional Horn-clause theories using a suite of abductive, deductive 
and inductive techniques. Deduction is used to identify the problems with the 
domain theory, while abduction and induction are used to correct them. The 
problem of theory revision has received a lot of attention in the field of inductive 
logic programming [2], where a number of approaches have been developed for 
revising theories in the form of first-order Horn clause theories. For an overview, 
we refer the reader to [9]. 

Two kinds of problems are encountered within imperfect domain theories: 
over-generality occurs when an example is classified into a class other than the 
correct one, while over-specificity occurs when an example cannot be proven to 
belong to the correct class. Note that a single example can be misclassified both 
ways at the same time. Overly general rules are either specialized by adding new 
conditions to their antecedents or are deleted from the knowledge base. Problems 
of over-specificity are solved by generalizing the antecedents of existing rules, e.g., 
by removing conditions from them, or by the induction of new rules. 

4 Grammar-Based Theory Revision of Equation Models 

4.1 Problem Definition 

The problem of grammar based theory revision can be formalized as follows. 

Given: 

— a set of variables V = vi,V 2 , ■ ■ ■ ,v„ of the observed system, including a 
target dependent variable Vd GV; 

— an existing model E, represented as an equation(s) defining the target vari- 
able Vd- Note that this can actually be a set of (algebraic or differential) 
equations defining the value of the target variable Vd', 

— a grammar G that derives the model E] and 

— a table M of observations (measured values) of the system variables. 

Find a revised model E' (equation/set of equations as above) that: 

1. is derived by the grammar G; 

2. minimizes the discrepancy between the observed values of the target variable 
Vd and the values of Vd obtained with simulating the model; and 

3. differs from the existing model E as little as possible. 

Items 2. and 3. above would typically appear in a formulation of a gen- 
eral theory revision problem, regardless of the language in which the theories 
are expressed. In contrast to our formulation, however, the possible changes to 
the initial theory would be specified in terms of revision operators that can be 
applied to the initial and intermediate theories. As theories are typically logical 
theories in theory revision settings, operators typically include addition/deletion 
of entire rules (propositional or first-order Horn clauses) and addition/deletion 
of conditions in individual rules. 
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4.2 From an Initial Model to a Grammar 

In a typical setting of revising an existing scientific model, we would only have 
observational data and a model, i.e., an equation developed by scientists to 
explain a particular phenomenon. A grammar that would explain how this model 
was actually derived and provide options for alternative models is typically not 
available. The above is especially true for simpler models. 

However, when the model (equation) is complex, it is only rarely written as 
a single equation defining the target variable, but rather as a set of equations 
defining the target variable, which typically contains equations defining interme- 
diate variables. The latter typically define meaningful concepts in the domain of 
discourse. Often, alternative equations defining an intermediate variable would 
be possible and the modeling scientist would choose one of these: the alternatives 
would rarely (if ever) be documented in the model itself, but might be mentioned 
in a scientific article describing the derived model and the modeling process. 



Table 2. Equations defining the NPPc variable in the CASA earth-science model. 



NPPc = max(0, E ■ IPAR) 

E = 0.389 -T1-T2-W 
T1 = 0.8 -I- 0.02 • topt — 0.0005 • topt^ 

T2 = 1.1814/((1 + . (1 + 

TDIFF = topt — tempo 
IT = 0.5 -t 0.5 • eet/PET 
PET — 1.6 • (10 • max(tempc, 0)/ahi)^ • petAw-m 
A = 0.000000675 • ahi^ - 0.0000771 • ahi^ + 0.01792 • ahi + 0.49239 
IPAR = FPAR_FAS ■ monthly _solar ■ SOL.CONV ■ 0.5 
FPAR^FAS = mm{{SR^FAS - 1.08)/ srdiff ,0.95) 

SR-FAS = (1 + fas -udvi/ 1000) /(I — fas _ndvi/ 1000) 

SOL_CONV = 0.0864 • days -per _month 



A set of equations defining a target variable through some intermediate vari- 
ables can easily be turned into a grammar, as demonstrated in Tables 2 and 3, 
which give an earth-science model and a grammar that derives this model only. 
Having the grammar in Table 3, however, enables us to specify alternative mod- 
els through providing additional productions for the nonterminal symbols in 
the grammar. Additional productions for intermediate variables would specify 
alternative choices, only one of which will eventually be chosen for the final 
model. Observational data would be then used to select among combinations of 
such choices, if we apply a grammar based equation discovery system (such as 
Lagramge) with the grammar that includes additional productions to observa- 
tional data as input. 

While the presented approach from the previous paragraph does take into 
account the initial model, it may allow for a completely different model to be 
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Table 3. Grammar derived from the equations for NPPc variable in the CASA earth- 
science model in Table 2. The grammar generates the original equations only. 



NPPc -> 
E -> 

T1 -> 

T2 -> 



TDIFF -> 
W -> 

PET -> 



A -> 



IPAR -> 
FPAR_FAS 
SR_FAS -> 

SDL.CQNV 



max (const [0:0] , E * IPAR) 

const [0.389: 0.389] * T1 * T2 * W 

const [0 . 8 : 0 . 8] + const [0 . 02 : 0 . 02] * topt 

- const [0 . 0005 : 0 . 0005] * topt * topt 

const [1 . 1814: 1 . 1814] / ((const[l:l] + exp(const [0 . 2 : 0 . 2] 

* (TDIFF - const [10 : 10] ) ) ) * (const [1:1] 

+ exp(const [0 . 3 : 0 . 3] * (-TDIFF - const [10 : 10] ) ) ) ) 
topt - tempo 

const [0 . 5 : 0 . 5] + const [0 . 5 : 0 . 5] * eet / max (PET, const [0:0]) 
const [1 . 6 : 1 . 6] 

* pow(const [10 : 10] * max(tempc, const[0:0]) / ahi , A) 

* pet_tw_m 

const [0 . 000000675 : 0 . 000000675] * ahi * ahi * ahi 

- const [0 . 0000771 : 0 . 0000771] * ahi * ahi 

+ const [0 . 01792 : 0 . 01792] * ahi + const [0 . 49239 : 0 . 49239] 
FPAR_FAS * solar * SDL.CDNV * const [0 . 5 : 0 . 5] 

-> min( (SR_FAS - const [1 . 08 : 1 . 08] ) / srdiff, const [0 . 95 : 0 . 95] ) 
(const [1:1] + fasmdvi / const [1000 : 1000] ) 

/ (const [1:1] - fasjidvi / const [1000 : 1000] ) 

-> const [0 . 0864: 0 . 0864] * days_per_month 



derived, depending on whether productions for alternative definitions are pro- 
vided for each of the intermediate variables. It is here that the minimal revi- 
sion/change principle comes into play: among theories of similar quality (fit to 
the data), theories that are closer to the original theory are to be preferred. 
Since we are dealing with theories that are not necessarily expressed in logic 
(e.g., equations), only syntactic criteria of minimality of change are applicable 
in a straightforward fashion. 

4.3 Typical Alternative Productions 

Note that when an alternative production is specified for an intermediate vari- 
able, there are no restrictions (at least in principle) on these productions. For 
example, they can introduce new intermediate variables and productions defin- 
ing them. They can also specify arbitrary functional forms (in the case of equa- 
tions). However, they do have to eventually derive (in the context of the entire 
grammar) valid sub-expressions involving the set of terminal symbols (system 
variables) associated to the initial model. 

A very common alternative production would replace the particular constants 
on the right-hand-side with generic constants, allowing the equation discovery 
system to re-fit them to the given observational data. In the grammar from 
Table 3 that change can be achieved by replacing a terminal symbol of the 
form const [v:v], denoting a constant parameter with fixed value v, with a 
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generic symbol const that allows for an arbitrary value of the particular constant 
parameter. In our experiments with the earth-science CASA model we allow for 
a 100% change of the original values of the constant parameters in the initial 
model. This can be specified by replacing the terminal symbol const [v:v] with 
const [0 : 2*v] , where interval [0 : 2 • w] is equal to [(u — 100% -v) : (u -I- 100% • u)] 
(a 100% relative change). 

A slightly more complex alternative production would replace a particular 
polynomial on the right-hand-side of a production with an arbitrary polynomial 
of the same (intermediate) variables. For example, in the grammar from Table 3 
can be replaced by a grammar, similar to the example grammar from Table 1, 
for generating an arbitrary polynomial of the variable topt. 



4.4 Current Implementation 

Our current implementation of the theory revision approach to equation discov- 
ery outlined above involves applying Lagramge to the given observational data 
and a grammar specifying the possible alternative productions to be used in 
theory revision. The observational data are used to select a particular combina- 
tion of the possible alternatives: note that these also include leaving parts of the 
model unchanged (as the original productions are also a part of the grammar) 
even if alternative productions for these exist. 

We currently do not have an implementation of the minimal change prefer- 
ence integrated within Lagramge. This however, can be achieved in a relatively 
straightforward manner. One of the heuristic functions used by Lagramge to 
search the space of equations, called MDL, takes into account the degree-of-fit 
(sum of square errors) as well as the size of the equation model. A reasonable 
approach to implement a minimality of change principle would be to replace 
the second term in the MDL heuristic: replace the size of the equation with a 
distance between the current model and the initial model. The distance mea- 
sure can be a distance on tree-structured terms, which would take into account 
the number and complexity of the alternative productions taken to derive the 
current equation. 



5 Experiments in Revising an Earth-Science Model 

We illustrate the use of the proposed framework for theory revision in equation 
discovery on the problem of revising one part of the earth-science CASA model 
[4] . The CASA model predicts annual global fluxes in trace gas production on the 
basis of a number of measured (observed) variables, such as surface temperatures, 
satellite observations of the land surface, soil properties, etc. Because the whole 
CASA model is a quite complex system of difference and algebraic equations, we 
focused on the revision of the NPPc part of CASA (CASA-NPPc), presented in 
Table 2, that is used to predict the monthly net production of carbon at a given 
location. 
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The values of the input variables (terminal symbols in the grammar from 
Table 2) were measured (and/or calculated) for 303 locations on the Earth pro- 
viding a data set with 303 examples. In order to evaluate the accuracy of the 
model on unseen data we applied standard ten-fold leave-one-out cross valida- 
tion method. The error of the original and revised models was calculated as root 

mean squared error defined as — NPPc)f/N, where N is num- 

ber of the data points; NPPci and NPPci are the observed value and the value 
calculated by the model, respectively. 

5.1 Revisions Used in the Experiments 

As described in Section 4 we first transformed the given NPPc model into a 
grammar (given in Table 3) that derives that model only. Furthermore, we added 
alternative productions to the grammar that define the space of possible revi- 
sions. We used six alternative possibilities for the revision of the NPPc model, 
described below. 

E-c-lOO : we allowed a 100% relative change of the constant parameter 0.389 
in the equation defining the intermediate variable E. Therefore, we replaced 
the original production for nonterminal symbol E in the grammar with E -> 
const [0 : 0 . 778] * T1 * T2 * W, i.e., changed the constraint on the value 
of the constant parameter from the original const [0 . 389 : 0 . 389] , which 
fixes the value of the constant parameter, to const [0:0. 778] , which allows 
a 100% relative change of the original value of the constant parameter ([0 : 
0.778] being equal to [(0.389 - 100% • 0.389) : (0.389 -k 100% • 0.389)]). 
Tl-c-lOO, T2-C-100 : we allowed the same revisions as the one described above 
on the right hand sides of the productions for T1 and T2. 

SR FAS-c-20 : we allowed 20% relative change of the constant parameters val- 
ues in the equation defining the intermediate variable SR-FAS. The relative 
change of 20% was used to avoid values of the constant parameters lower 
than 800, which would cause singularity (division by zero) problems in the 
formula for calculating SR-FAS. 

Tl-s : we allowed the original second degree polynomial for calculation of 
T1 = 0.8 + 0.02 • topt — 0.0005 • topt^ with an arbitrary polynomial of the 
same variable topt. The following alternative productions were added to the 
grammar from Table 3 for this purpose: T1 -> const and T1 -> const + 
(Tl) * topt. 

T2-S : the graph of the dependency between the T2 and TDIFF variables shows 
a Gaussian- like slightly asymmetrical dependency curve. Following the fact 
that this kind of dependency can be approximated also with a higher degree 
polynomial we replaced the original Tl production in the grammar from 
Table 3 with two productions (similar to the ones for Tl-s, presented above) 
that define an arbitrary polynomial of the TDIFF variable. 

In addition to these six possibilities for revising the CASA-NPPc model we 
also used different combinations of them. 
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5.2 Results of the Experiments 

The results of the experiments with different alternative grammars for revision 
are presented in Table 4. 



Table 4. Error reduction (in %) gained with revising the original CASA-NPPc model 
using different grammars for revision. 



Grammar 


Reduction of RMSE (in %) 


SR_FAS-c-20 


14.93 


T2-C-100 


13.25 


Tl-s 


13.05 


T2-S 


12.90 


E-c-lOO 


12.59 


Tl-c-lOO 


12.39 


SR_FAS-c-20 -t T2-S 


15.56 


SR_FAS-c-20 -t Tl-s 


15.46 


T2-C-100 -t Tl-s 


13.92 


T2-S -t Tl-s 


13.30 


SR_FAS-c-20 -t T2-C-100 


11.55 



SR_FAS-c-20 + T2-S + Tl-s + E-c-lOO 16.19 

SR_FAS-c-20 + T2-S + Tl-c-lOO + E-c-lOO 15.44 

SR_FAS-c-20 + T2-C-100 + Tl-s -f E-c-lOO 14.82 

SR-FAS-c-20 + T2-C-100 -f Tl-c-lOO -f E-c-lOO 12.92 



The first six rows of Table 4 shows that revising the value of the constant 
parameters in the equation for calculating SR-FAS gives the greatest improve- 
ment of the original model. The original value of the parameters (equal to 1000) 
defines an almost linear dependence of SR-FAS on observed variable srdiff. The 
revised values of the constant parameters were equal to 800 (lowest possible val- 
ues), which increase the non-linearity of the dependence. Allowing lower values 
of the parameters in the equation gives further improvement, but singularity 
(division by zero) problems appear due to the range of the srdzj(f variable. 

The analysis of the results of the structural revisions shows the following. Tl- 
s revision cause the second-degree polynomial for calculating the T1 variable to 
be replaced by a fourth degree polynomial. On the other hand, the structural 
revision T2-s reduced the complex formula for calculating T2 with a constant 
value. This is a surprising result that would have to be discussed with the Earth 
science experts that built the CASA model. 

Furthermore, we tested pairwise combinations of the six model refinements. 
The results are presented in the second part of the Table 4. Results show that im- 
provements gained using individual refinement grammars do not combine addi- 
tively. However, combinations do increase the improvements: maximal improve- 
ment gained with pairwise combinations is 15.56% compared with the highest 
improvement of 14.93% gained using individual revisions. 
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Finally, the results of the experiments with combining all the refinements are 
presented in the last four rows of Table 4. Note however, that revisions of the 
T1 and T2 structures (Tl-s and T2-s) are mutually exclusive with the respective 
revisions of the T1 and T2 constants (Tl-c-lOO and T2-C-100). Therefore, four 
possible combinations are possible, the one combining the structural revisions of 
the T1 and formulas and revisions of the values of the constant parameters in 
formulas for the SR-FAS and E gives the maximal improvement of the accuracy 
of 16.19%. 

In sum, the presented results of the experiments show that small revisions 
of the CASA-NPPc model parameters and structure considerably improve the 
accuracy of the model, the maximal improvement being above 16%. However, 
Earth science experts should also evaluate the comprehensibility and acceptabil- 
ity of the revised models. Nevertheless, if some of the revisions generate models 
that do not make sense from their point of view, new alternative productions 
would have to be defined to reflect the experts comments, and allow only revi- 
sions that lead to acceptable models. 

Note here that the most of the error reduction is gained using a fairly sim- 
ple revision operator of changing the values of the constant parameters in the 
SR-FAS equation. Only minor additional reductions can be obtained by combin- 
ing this revision with any of the other five revision operators described above. 
Therefore, this revision would probably be the optimal one from the point of 
view of the minimality of change criterion, discussed in Section 4. 

6 Conclusions and Discussion 

We have presented a general framework for the revision of theories in the form 
of (sets of) quantitative equations. The method is based on grammars, which 
can be derived from the original theory. Domain experts can focus the revision 
process on parts of the model and guide it by providing relevant alternative 
productions. In this way, the revision process can be interactive, as is quite 
often the case when revising theories expressed in logic. 

We have applied our approach to the problem of revising an existing equa- 
tion based model of the net production of carbon in the Earth ecosystem. Ex- 
perimental results show that small revisions in both the values of the constant 
parameters and the structure of equations considerably reduce the error of the 
model by 16%. 

Saito et al. [5] address the same task of revising scientific models in the form 
of equations. Their approach is based on transforming parts of the model into 
a neural network, training the neural network, then transforming the trained 
network back into an expression/equation. This indirect approach is limited to 
revising the parameters or form of one equation in the model at a time. It also 
requires some handcrafting to encode the equations as a neural network ~ the 
authors state that “the need to to translate the existing CASA model into a 
declarative form that our discovery system can manipulate” is a challenge to 
their approach. 
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Our approach allows for a straightforward representation of existing scien- 
tific models as grammars, which can then be directly manipulated and used to 
perform theory revision. The transition from the initial model to a grammar is 
so straightforward that we consider automating this process as one of the topics 
for immediate further work. Revisions to several equations of the original model 
may be considered simultaneously, as illustrated by the experiments performed. 

Whigham and Recknagel [8] also consider the specific task of revising an 
existing model for predicting chlorophyll-a by using measured data. They use a 
genetic algorithm to calibrate the equation parameters. They also use a grammar 
based genetic programming approach to revise the structure of two sub-parts 
(one at a time) of the initial model. A most general grammar that can derive an 
arbitrary expression using the allowed arithmetic operators and functions was 
used for each of the two sub-parts. 

Unlike this paper, Whigham and Recknagel [8] do not present a general 
framework for the revision of quantitative scientific models. Their approach is 
similar to ours in that they use grammars to specify possible revisions. However, 
the grammars they use are too general to provide much information about the 
domain at hand. Also, they do not consider the notion of minimality of revision 
and genetic programming typically produces very large expressions without a 
simplicity bias. 

As already mentioned, an immediate topic for further work is to automate the 
grammar generation from the initial model. Another challenge is to provide the 
domain experts an interactive tool for testing out different alternatives for revi- 
sion. Furthermore, integrating the minimality of change criterion in Lagramge 
is also an open issue. Minimal description length (MDL) heuristics in Lagramge 
can be adapted to take into account the distance between the current and the 
initial equation model. Finally, we plan to apply the proposed framework to the 
task of revision of other portions of the CASA model as well as revision of other 
equation based environmental models. 



Acknowledgments 

We thank Christopher Potter, Steven Klooster and Alicia Torregrosa from 
NASA-Ames Research Center for making available both the CASA model and 
the relevant data set. 

References 

1. P. Langley, H. A. Simon, G. L. Bradshaw, and J. M. Zythow. Scientific Discovery. 
MIT Press, Cambridge, MA, 1987. 

2. N. Lavrac and Saso Dzeroski. Inductive Logic Programming: Techniques and Ap- 
plications. Ellis Horwood, Chichester, 1994. Freely available at 

http: / /www-ai.ijs.si/SasoDzeroski/ILPBook/. 

3. D. Onrston and R. J. Mooney. Theory refinement combining analytical and em- 
pirical methods. Artificial Intelligence, 66:273-309, 1994. 




400 



L. Todorovski and S. Dzeroski 



4. C. S. Potter and S.A. Klooster. Interannual variability in soil trace gas (C02, 
N20, NO) uxes and analysis of controllers on regional to global scales. Global 
Biogeochemical Cycles, 12:621-635, 1998. 

5. K. Saito, P. Langley, and T. Grenager. The computational revision of quantitative 
scientific models. 2001. Submitted to Discovery Science conference. 

6. L. Todorovski and S. Dzeroski. Declarative bias in equation discovery. In Pro- 
ceedings of the Fourteenth International Conference on Machine Learning, pages 
376-384, Nashville, MA, 1997. Morgan Kaufmann. 

7. T. Washio and H. Motoda. Discovering admissible models of complex systems 
based on scale-types and identity constraints. In Proceedings of the Fifteenth In- 
ternational Joint Conference on Artificial Intelligence, volume 2, pages 810-817, 
Nogoya, Japan, 1997. Morgan Kaufmann. 

8. P. A. Whigham and F. Recknagel. Predicting chlorophyll-a in freshwater lakes 
by hybridising process-based models and genetic algorithms. In Book of Abstracts 
of the Second International Conference on Applications of Machine Learning to 
Ecological Modeling. Adelaide University, 2000. 

9. S. Wrobel. First order theory refinement. In L. De Raedt, editor, Advances in 
Inductive Logic Programming, pages 14-33. lOS Press, 1996. 

10. R. Zembowicz and J. M. Zytkow. Discovery of equations: Experimental evalua- 
tion of convergence. In Proceedings of the Tenth National Conference on Artificial 
Intelligence, pages 70-75, San Jose, CA, 1992. Morgan Kaufmann. 




Simplified Training Algorithms for 
Hierarchical Hidden Markov Models 



Nobuhisa Ueda^’^ and Taisuke Sato^’^ 



^ Dept, of Computer Science, Tokyo Institute of Technology 
2 CREST, JST 

2-12-2 Ookayama Meguro-ku Tokyo Japan 152-8552 
uedaSmi . cs . titech. ac . jp, satoOcs . titech. ac . jp 



Abstract. We present a simplified EM algorithm and an approximate 
algorithm for training hierarchical hidden Markov models (HHMMs), 
an extension of hidden Markov models. The EM algorithm we present 
is proved to increase the likelihood of training sentences at each iter- 
ation unlike the existing algorithm called the generalized Baum- Welch 
algorithm. The approximate algorithm is applicable to tasks like robot 
navigation in which we observe sentences and train parameters simulta- 
neously. These algorithms and their derivations are simplified by making 
use of stochastic context-free grammars. 



1 Introduction 

Hidden Markov models (HMMs) are a class of statistical language models to 
capture and predict uncertain phenomena from data, and have succeeded in 
numerous applications including speech recognition [12], and computational bi- 
ology [8] . To describe more complex models, various extensions of HMMs have 
been proposed recently such as Input-Output HMMs [3], factorial HMMs [7], 
hierarchical HMMs (HHMMs) [6], and maximum entropy Markov models [11]. 

HHMMs were proposed to efficiently describe global dependencies over sen- 
tences (data) by incorporating hierarchical structures of sentences. To discover 
hierarchical structures from data, they have been applied to practical tasks such 
as recognition of cursive handwriting [6] and robot navigation [14]. HHMMs, 
however, have two issues: One is that efficiency of HHMMs has not been com- 
pared to that of stochastic context-free grammars (SCFGs) yet, though HHMMs 
are claimed as a simpler alternative to SCFGs. The other is an Expectation- 
Maximization (EM) algorithm for HHMMs called the generalized Baum- Welch 
algorithm [6] . We found some faults with the algorithm, and fixed them. But even 
after fixing them, it did not always increase the likelihood of training sentences, 
which should not happen for any EM algorithm. As using incorrect training al- 
gorithms, one might be led to discover false knowledge by inaccurate parameters. 

In this paper, we prove contrary to the claim about HHMMs in literature [6] 
that HHMMs are efficiently representable with SCFGs. We also derive a simpli- 
fied EM algorithm for HHMMs. Thanks to simplicity of the EM algorithm, we 
can also derive an approximate algorithm for training HHMMs. 



K.P. Jantke and A. Shinohara (Eds.): DS 2001, LNAI 2226, pp. 401—415, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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2 Preliminaries 

2.1 Hierarchical Hidden Markov Models 

We review hierarchical hidden Markov models (HHMMs) [6]. Notation we use 
is different in part from the original one [6] for notational convenience. Let 
(5 = {<7 i, . . . , Qn} be a set of states, and S = {ai, . . . , am} a set of symbols. O = 
{o^, . . . , o’'} C i7+ denotes a set of training sentences, and {I < u < v) 

a part of a training sentence o“ from the s-th symbol to the t-th symbol. 

An HHMM has three types of state, internal states, production states, and 
end states. In an HHMM, there is a unique internal state qi called the initial 
state, and qi has a “submodel.” The submodel is composed of either at least 
one production state, or one end state and at least one internal state. Every 
internal state in a submodel has a unique individual submodel recursively. An 
internal state qi and a state qj are called a parent state of qj and a substate of 
qi, respectively, if a submodel of qi contains qj. An internal state qi is said to be 
a neighbor state of a state qj if qi and qj are in the same submodel. denotes 
an end state which is a neighbor state of qi. 

Recursiveness of submodels forms a tree structure in an HHMM. In the tree, 
leaves, nodes, and the root correspond to production states, submodels, and a 
submodel containing the initial state, respectively. The depth of a state qi is 
defined as the depth of a submodel containing qi in the tree. Let D denote the 
maximum depth of states in an HHMM. Without loss of generality, we assume 
i < j if qi and qj are at depth d and d' {d < d') respectively. 

In each state qi, depending on the type of qi, three types of transition called 
vertical transitions, horizontal transitions, and forced transitions can occur. First 
suppose we are in an internal state qi. A vertical transition to qj occurs if either 
the previous state is an internal state or this transition is the first one, where qj 
is a substate of qi except an end state. The next state qj is chosen according to 
vertical transition probabilities TTij. If the previous state is a production state or 
an end state, a horizontal transition from qi to qu occurs where qk is a neighbor 
state of qi. The next state qk is selected with horizontal transition probabilities 
Oi^k- Let Oi^end denote the probability of a horizontal transition from qi to 
Second, if we are in an end state 9™'^, we move up by a forced transition to the 
parent state of qi. Lastly, if we are in a production state qi', we observe a symbol 
ah according to output probabilities bi'^h, and move up by a forced transition to 
the parent state of qi'. 

For an internal state qi, let sub{i) be a set of indices of substates of qi except 
end states, and fwd{i) that of indices of neighbor states of qi except end states. 
For a production state qv, let sym{i') be a set of indices of symbols that qi' 
outputs. To make probabilities iTij, ai^k, and bi^u consistent with an HHMM, for 
any internal state q^, it must hold that J2j(^sub(i) = «i.end + J2k(^fwd{i) °-i,k = 
1, and bi^h = 0 for any h. For any production state qe, it must also hold Tii'j = 
ai'j = 0 for any j, and Y.hasvm{v) ^i’,h = 1- 

A sentence generated by an HHMM is a sequence of symbols output in a 
state sequence from the initial state qi to Suppose an HHMM in Fig. 1 (a) 
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O : internal state 
□ : end state 
A : production state 
( ) : submodel 
— : vertical transition 
— ^ : horizontal transition 
••••* : forced transition 



Fig. 1. Examples of HHMMs. (a) A partial-transition model, (b) A fnll-transition 
model. For any sentence o, a probability that (a) outpnts o is equivalent to a probability 
that (b) does, and are identical to and respectively. 



outputs a sentence cri(JiCT4(T2cr3 such that production states (77, gg, and t/g in the 
HHMM output (Ti, CT2, and (T3, respectively. Let qi{cfj) stand for qi outputting 

aj, and let and ^ represent a vertical transition, a horizontal transition, 

and a forced transition, respectively. One possible state sequence is the following: 



qi 


V 

92 


V 

94 


97(0-1) 




94 


h 


94 


97(^1) ^ 94 


„end 


92 


h 

93 


V 

96 


99(0-3) 




96 


h 


95 


98 (^2) ^ 95 


h 

96 


99(0-3) 


/ 

^ 96 


J}. „end 
^ % 


/ 

^ 93 


h 


„end 

H2 




9i 


„end 
^ HI ’ 





For horizontal transitions, there are two confusing descriptions in the orig- 
inal paper [6]. One says, as we have described, that horizontal transitions are 
allowed only from internal states.^ The other says that horizontal transitions 
can occur from production states.^ They are obviously conflicting. We name 
them a partial-transition model and a full-transition model for later referece, re- 
spectively. Figure 1 (a) and (b) are examples of a partial-transition model and 
a full-transition model, respectively. The original training algorithm called the 
generalized Baum- Welch algorithm [6] seems applicable to only a full-transition 
model. On the other hand, we adopt a partial-transition model in order to make 
proposed training algorithms and their derivations simple. 

Lastly, for timing analysis, we evaluate the number of states in a partial- 
transition model M transformed from a full-transition model M' . One way to 
transform from M' to M is that, for each production state qi in M' , qi is set 

^ For example, “... an HHMM is characterized by the state transition probability 
between the internal state and the output distribution vector of the production 
states.” in Sect. 2 of [6]. 

^ For an internal state q at depth D — 1, a transition probability matrix = 

D-l 

(aL ) is defined in Sect. 2 of [6]. This matrix contains horizontal transition prob- 
abilities aT = P(gF|(j'f’) between production states, qf and qf. 
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to an internal state, and to have a unique production state in M. Figure 1 is 
an example of this transformation. Let riint and be the numbers of internal 

states in M and M', respectively, and ripro and those of production states 

in M and M', respectively. We then have riint = and Upro = tipro- 

2.2 The Generalized Baum- Welch Algorithm 

The generalized Baum- Welch algorithm was proposed as an EM algorithm for 
HHMMs [6], and was analyzed to take 0{vnl^) time to update parameters of an 
HHMM,^ where v is the number of training sentences, n is the number of states, 
and I is the maximum length of training sentences. 

For each training sentence oi, . . . ,oi>, the generalized Baum- Welch algo- 
rithm requires various probabilities a{s,t, qi, qk), f3{s,t,qi,qk), ? 7 in(s, gi, ft), 
Vout{s,qi,qk), ^{s,qi,qj,qk), %n{s,qi,qk), 7o«i(s. ft), and x(s,qi,qk), where 
I < s < t < I' , qk is an internal state, and qi and qj are substates of qk- 
We argue here with a{s,t,qi,qk), riout{s,q^,qk), and ^{s,qi,qj,qk)- a{s,t,qi,qk) 
is a probability that symbols o* • • • Oj are output in any state sequences from 
ft to qi, rjout{s,qi,qk) is one that a horizontal transition from qi occurs before 
Os+i,...,op are output. S,{s,qi,qj,qk) is one that a horizontal transition from 
qi to qj occurs between output of Oi, . . . ,Og and that of Og+i , . . . ,oi>. Due to 
space limitations, see the original paper [6] for further details of the other prob- 
abilities. From the viewpoint of calculating these probabilities, the generalized 
Baum- Welch algorithm has the following four drawbacks. 

First, the definition of r]out{l' ,Qi,Qk) in [6] is incomplete. r]out{l' ,Qi,Qk) is 
recursively defined using ft, ft) where qh is a parent state of qk- The 

basis, riout{l' , Qi' , Qi) j is not defined properly,^ where is an internal state at 
depth 2. rioutil' , Qi, Qk) is required by y(s, qg, qi), and x(s, ft, Qi) is necessary for 
updating parameters iVi^g, where qg is a substate of qi- Then parameters are not 
updated by the generalized Baum- Welch algorithm. 

Second, the generalized Baum- Welch algorithm sets a probability ^{l' , qi',qj', 
qi) of a logically impossible event in any HHMM to non-zero, where and qji 
are internal states at depth 2. ,qi> ,qj' ,Qi) is a probability that a horizontal 
transition from qi/ to qj/ occurs after symbols oi, . . . , o;. are output. If a hori- 
zontal transition from g^/ to qj' were possible, a sequence of vertical transitions 
would continue from g^/ to some production state, and then the (Z'-l-l)-th symbol 
of a sentence would be output. This contradicts that the length of the sentence 
is I' - 

Third, the likelihood of a sentence o defined in the generalized Baum- Welch 
algorithm is not equivalent to a probability that an HHMM outputs o. The likeli- 
hood P{o\9) is defined by ^i)- ft) is a probability 

that an HHMM outputs o\,. - . ,ov from gi to qi> , but it contains a probability 
that an HHMM continues to output o;/+i. 

® To obtain this bound, |su6(i)| and |/wd(i)| are implicitly assumed to be bounded 
by some constant for any i. 

VoutH' ,qi>,Qi) is recursively defined using rjout(l' , Qi, Qh) where qh is a parent state 
of gi, which never exists in any HHMM. 
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Fig. 2. Cnrves of the “log-likelihood” plotted by the generalized Baum- Welch algorithm 
with modifications: Experiments with all combinations of redefinitions (b) and (c) are 
demonstrated. The dotted lines show the maxima for their curves. 



Therefore, to train parameters of HHMMs with the generalized Baum- 
Welch algorithm, we have to redefine (a) rjout{T , qe , qi) = a^'^end to up- 
date parameters of an HHMM, (b) ^{l' ,qi> ,qj> ,qi) = 0, and (c) P{o\9) = 
a(l, (ji)aigend to make the generalized Baum- Welch algorithm 
consistent with an HHMM. 

Lastly, a proof that the generalized Baum- Welch algorithm is an EM algo- 
rithm was not contained in [6]. In addition, the algorithm sometimes showed 
decreases in the likelihood in our implementation. Figure 2 shows curves of the 
log-likelihood of 100 training sentences by the generalized Baum- Welch algo- 
rithm, given the HHMM in Fig. 1 (b) such that every production state emits 
any symbols. The sentences were generated randomly with the HHMM in Fig. 1 
(b) such that production states <74, <75, and q^ in the HHMM output cti, CT 2, and 
CT3, respectively. This result is obviously against the fundamental property of EM 
algorithms in which each iteration is guaranteed to increase the log- likelihood [5] . 
Though we have fixed several drawbacks in the generalized Baum- Welch algo- 
rithm, it seems vain to make up our implementation into an EM algorithm with- 
out a proof that the generalized Baum- Welch algorithm is an EM algorithm. We 
will therefore derive a new EM algorithm for HHMMs. 

3 Training Algorithms 

3.1 Description with Stochastic Context-Free Grammars 

Before presenting training algorithms, we show that given an HHMM M, we 
are able to construct a stochastic context-free grammar (SCFG) G such that a 
set of sentences from M is equivalent to that of sentences from G. For space 
limitations, we refer the reader to e.g. [4] for definitions of SCFGs. 

For notational convenience, let a subsentence of be symbols generated in 
a state sequence from qi to either or the parent state of qi. Qint, Qpro, and 
Qend are a set of internal states, that of production states, and that of internal 
states qi such that a horizontal transition from g^ to gf"'^ is available, respectively. 
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For a production state qi/, recall that sym{i') is a set of indices of symbols 
that Qii outputs. Ail, a non-terminal of an SCFG, is able to generate any subsen- 
tence of Qii if a set of rules of an SCFG contains Aii at for any h G sym{i'). 
For an internal state Qi, we consider a non-terminal Ai. Any subsentence of qi 
is equivalent to either w or ww' where rc is a subsentence of a state qj in a 
submodel of qi, w' is a subsentence of a neighbor state qk of qi, and ww' is the 
concatenation of w and w'. Roughly speaking, Ai is able to generate any subsen- 
tence of qi if a set of rules contains Ai AjAk and Ai Aj for any j G sub{i) 
and k G fwd{i). 

From these observations, we can conclude that given an HHMM M, there ex- 
ists an SCFG G = {N, S, R, Ai) which is able to derive every sentence generated 
by M where N = {Ai, . . . , A„}, 

R= {Ai^ AjAk\qi G Qint, j G sub{i),k G fwd{i)} 

U {Ai Aj\q, G QendJ G sub{i)} U {Ai ah\qi G Q^ro,h G sym{i)}. 

We then formally show that a set of sentences from G is equivalent to that 
from M. Let LM{qi,l) be a set of subsentences w of qi such that the length of 
w is at most I, and Lc{Ai, 1) be a set of symbols w G A+ such that Ai^w and 
licl < 1 . For any qi and Ai, we set Lufiqi, 0) = Lc{Ai, 0) = 0. For any pair of M 
and G, the following holds. 

Lemma 1. Let qi be an internal state, and I > 1. Suppose LM{qj,l) = La{Aj,l) 
and LM{qk,l ~ 1) = Lc{Ak,l — 1) for any qj and qk such that qj is a substate 
of qi, and qk is a neighbor state of qi. It then holds that LM{qi,l) = LciAi,!). 

Proposition 1. For an HHMM M and an SCFG G constructed as above, let 
Lniqi) be a set of subsentences of qi, and Lc{Ai) a set of subsentences derived 
from Ai. For any production state or internal state qi, Ljkjiqi) = LaiAi). 

They are proved in the appendices. 

For example, from the HHMM in Fig. 1 (a), we construct a set of rules Re^: 

{ A\ — > A2IA3, A2 A4A2IA4A3, 

A3 — *■ A5A2IA5A3IA6A2IA6A3IA5IA6, A4 A7A4IA7, 

A3 AsAslAsAglAg, Aq AgA^lAgA^lAg, ’ 

A7 CTi, Ag ^ ( 72 , Ag ^ as 

where ^ ai| • • • \ak is an abbreviation for Ai a\, . . . , Ai ^ ak. With i?ex, 
the start symbol Ai is able to derive a sentence cricricr4(J2cr3 as follows: 

A\ — > A2 — > A4A3 — > A7A4A3 — > (T1A4A3 

^ CT1A7A3 ^ aiaiAs cTiCTiAg ^ cridiAgAs ^ cricriCT3A5 

— > Cri(Ti(J3A8Ag — > aiaiasagAs — > (TiCTi(T 3(T2A9 — > Cri(JiCr3(72(J3. 

Several training algorithms for SCFGs such as the Inside-Outside algo- 
rithm [1,10] and Stolcke’s algorithm [13] are known. They, however, are not 
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usable for training an SCFG which represents an HHMM. One reason is that 
parameters of usual SCFGs differ from those of SGFGs describing HHMMs. In 
a usual SGFG, a probability of each rule is described by one parameter. On 
the other hand, in an SGFG describing an HHMM, we see P{Ai AjAk\0) = 
P{Ai Aj\0) = TTijOi^end, and P{Ai cfj\9) = bi^h, where 9 represents 
a set of parameters. That is, probabilities of several rules consist of two param- 
eters. Hence, we need to derive new training algorithms for SGFGs describing 
HHMMs. 

3.2 An EM Algorithm 

For any SGFG describing an HHMM, the following holds similarly to usual 
SGFGs [9]. 

Proposition 2. For an SGFG which describes an HHMM, it holds that P{0\9') 
> P{0\9) if any parameter for a non-terminal Ai is updated by 

7,,,^ 1 dP(o^l9) 

where P(Oj9) = n«=i is a normalizing constant of gi, v is the 
number of training sentences, and 9' is a set of updated parameters. 

Proposition 2 is proved in the appendix. From this, we are able to update 
parameters if we find partial derivatives the likelihood P(o“|0) of sen- 

tence o“, and Zi for any i, j, and u. For brevity, only parameters TTiy, of vertical 
transitions will be considered here, but the derivations for the other parameters 
are analogous. 

First, find . For notational convenience, we set P the length of the 

M-th sentence, fu{s,t,i) = P(^i 0 “ • • • • • • o“„ |0), and e„(s,t,z) = 

p{A^A,o:---om- 

d Y. P(o\t\9) 

dP{d^\0) TG'T {u,7Tij) 

= EE E 

s = l t — s TG'T{u,7Tij,S,t) 

= EE 

s—1 t—s 

t-1 

+ E E E 

s=l i=s-|-l kGfwd(i) r=s 

where T{u,7Tij) is a set of possible parse trees t for o“ such that at least one 
rule with iTij occurs in t , i.e., r contains Ai Aj or Ai AjAk for some k, 
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and T(m, TTij,s, t) a set of parse trees r for o“ such that Ai^ and this 

Ai is directly derived by a rule with in r. The second line is obtained using 
a formula for differentiation and the third line is 

obtained by the context-free assumption. From this, eu{s,tA) and fu{s,t,i) are 
required to calculate the partial derivatives. 

We can find e„(s, t, i) and fu(s, t, i) recursively in a manner similar to calcu- 
lation of the inner probabilities and the outer probabilities in the Inside-Outside 
algorithm [1,10]. We calculate the inner probability e„(s,t,i) by 



^u(^: t, z) — 

for a production state qi, and by 



h,h s = t, where Os = ah, 
0 s <t, 



^ ^ S, j)(Zi^end s — t, 

j^sub{i) 

e n^jeu{s,r,j) ^ ai,fce„(r -h 1, t, fc) 

r—s yj^sub^i) j \^kGfwd{i) 

T ^ j^bli^Qnd S ^ t, 

t j^sub{i) 

for an internal state qi. We also define the outer probability /„(s,t, i) as 



r 1 

0 



fu{s,t,i) = < 



s = 1, t = I, z = 1, 

S ^ t, qi G. Q-pro, 



E E E .f{r,t,j)TTj,k e„(r, s- l,k)aj^i 

j^bwd{i) k^sub{j) r —1 

r 

+ E E E fu{s,r,j) TTj,ieu{t+ l,r,k)aj^k 

j£prt{i) k£fwd{i) 

+ E fu{s,t,j)TTj^iaj^end Otherwise, 

k jeprt(i) 



where prt{i) = {j\i € sub{j)}, and bwd{i) = {k\i G fwd{k)}. 

Second, the likelihood P(o“|d) is equivalent to P(^i 4> o“|0) = e„(l, l“, 1). 
Lastly, it holds that a*, end + J2kefwd{i) = 1 if we set Zu,i = J2[=s 

fu{s,t,i)eu{s,t,i) and Zi = Zu,^/ P{o^\0) for qi G Qint- 

Likewise, the other partial derivatives are defined as 



aP(o“|6») 

^a^^k 



dP{o^\e) 

^^i,end 



t-1 

E -..E E E fu{s,t,i)eu{s,r,j)eu{r + l,t, k), 

jGsub(i) s—lt—s-\-lr—s 

EE fu{s,t,i) ^ 7Tijeu{s,t,j), 

s —1 t—s j^sub(i) 
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ap(o“|6») 

dbi^h 






and normalizing constants for G Qpro are defined as = X)«=i 
where fu(s, s,i')^uis, s,i')- By updating parameters according to 

the above equations, it is guaranteed to increase the likelihood of training sen- 
tences from Proposition 2. 

As a summary, Fig. 3 shows a pseudo-code for this EM algorithm. It re- 
quires 0{v{riintl^ + Uprol^)) time to update parameters of an HHMM, i.e., to 
calculate the inner probabilities, the outer probabilities, the partial derivatives, 
and the normalizing constants, where rii^t = IQintI, ^pro = IQprol) and |sM&(f)| 
and \fwd{i)\ are bounded by some constant. When a full-transition model is 
given, 0{vnP) time is sufficient to update parameters by this algorithm.® This 
time bound is as efficient as that of the generalized Baum- Welch algorithm, and 
it turns out that HHMMs are efficiently representable with SCFGs. 



t ■- 0; P{0\e<-°^) := -oo; 

for all 7 ij G 8^^^ do 

initialize jij randomly s.t. ^ . jij = 1 for any i; 
repeat 
t:=t + l\ 
for u := 1 to V do 

for ts := 0 to 1“ — 1 do 
for s := 1 to — ts do 
for i := n downto 1 do 

find e„(s, s + ts,i)', /* inner probabilities */ 
for ts := P — 1 downto 0 do 
for s := 1 to — ts do 
for i ;= 1 to n do 

find fu{s, s + ts, i)\ /* outer probabilities */ 

for i — 1 to n do 

find Zu,i', /* normalizing constants */ 
for all 7 ij G do find dP{p^\6) / 

for all 'yij G 9^^^ do update 'jij; 
until P( 0 | 6 l(b) _ P(0|6)b-b) < e; 

output 



Fig. 3. An EM algorithm for hierarchical hidden Markov models. stands for a set 
of parameters at t-th iteration. 



® We have v{rnntl^ + UprJ^) = i>((n[nt -|- ripro)^^ + < 2v{n[nt + Wpro)^^ < 2vnl^, 

where and rip^o are the numbers of internal states and that of production states 
in a full-transition model. 
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3.3 An Approximate Algorithm 



An approximate algorithm for training HHMMs was reported to take 0{vnl^) 
time to update parameters [6], where v is the number of training sentences, n 
is the number of states, and I is the maximum length of the training sentences. 
Unfortunately, it was neither theoretically validated nor explained in detail in [6] . 

We present another approximate algorithm, which takes 0{nl^) time to up- 
date parameters. The idea of this algorithm is that at each iteration it selects 
several sentences from training sentences, and increases the log-likelihood of the 
selected sentences. Let 0{t) = {o*’^, . . . , } C O be a set of selected sentences 

at t-th iteration, where v' is bounded by some constant. Choosing sentences 
makes time for updating parameters independent of the number of training 
sentences v. In addition, this algorithm does not have to prepare all training 
sentences before training is started. This assures us that this algorithm makes it 
possible to observe sentences and train parameters simultaneously in practical 
tasks like robot navigation [14]. 

This algorithm is guaranteed to increase the likelihood of selected training 
sentences at each iteration, but it may decrease that of all training sentences. 
To avoid this drawback of the approximate algorithm in which the likelihood 
of all training sentences may decrease, the approximate algorithm can be com- 
bined with the EM algorithm. That is, the approximate algorithm roughly but 
efficiently estimates parameters in early stages, and then the EM algorithm tries 
to maximize the likelihood of all training sentences. This combination is called 
the hybrid algorithm. 

In the hybrid algorithm, it is not clear when the EM algorithm replaces 
the approximate algorithm since the approximate algorithm does not require 
the likelihood of the whole training sentences. We then set thresholds ehybrid 
and thybrid, count the number of iterations t' at which an increase of the log- 
likelihood of selected sentences by the approximate algorithm is less than Chybrid, 
and heuristically switch from the approximate algorithm to the EM one when 

t ^ ^hybrid ■ 

The approximate algorithm for HHMMs is based on the smooth on-line learn- 
ing algorithm for HMMs [2] since it is one of the simplest algorithms for training 
stochastic language models as far as we are aware. Weights w^j., and w\ 
are introduced for the normalized-exponential representation [2], and parameters 
of an HHMM are defined as 



exp(AwJ[) 



E 



k^sub{i) 



exp(A<fc) ’ 






exp(A<fc) 



exp(Aw“„„d) -k E 






exp(A< ■) ’ 



a 



exp(A<„„d) 



z,end — 



exp(Aw“„„d) -k X; 



z,end/ 
■'jefwd(i) 



exp(Aw“j) 



bi^h — 



exp(AwE) 



E 



k^sym{i 



i,h) 

exp(Aw[' .) ’ 



where A is a positive constant. 

Let 9 denote a set of parameters at t-th iteration, and 9' a set of updated pa- 
rameters. Suppose 9' is sufficiently close to 9. We approximate the log-likelihood 
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of selected sentences 0{t) by a first order Taylor expansion around 9\ 

\ogp{o{t)\0') ^ \ogp{o{t)\9 ) + ^ 

where w]j is a weight for a parameter jij of an HHMM. P{0{t)\9') > P{0{t)\9) 
holds if 77 is a small positive, and . We sketch how to 

calculate (the other partial derivatives are found in a similar way). 



dlogP{0{t)\9) 



E dT^i,k 

dwf, ^ P(o*’^\9) diTik 

kesub{%) *0 U =1 V I 7 i,K 



1 9P(o*’“|6») 



dw. 






— A7Tij(l TTij) 



1 aP(0*’“|6») 



P(o‘’“|6») aTTi^j 



1 ap(o‘’“|6») 



fces«6(d\{i} “=i ^ I ^ *’'= 



~ Att^ j- ^ 



P(o*-“|6») \ a7T,j 



ap(o‘-“|6») ^ ap(o*’“|6<)' 



k^sub{i) 



diTi, 



— att^ j- 



ap(o‘>“|6») 



P(0*-“|6») a7T,j 






I 5 



where we obtain the first line by the chain rule. Likewise, for the other weights 
w] j, we set 



X 1 / ap(o*’“|6») ^ \ 

E F(^ (^jy— - 

such that and Zu,i are already defined in the EM algorithm. By the 

approximate training algorithm, it only takes 0{riintl^ + Uprol^) time to update 
parameters of an HHMM. 



4 Experiment 

In this section, we show an experimental result with the proposed algorithms. To 
compare them with the generalized Baum- Welch algorithm, we use the same data 
set as in the experiment in Sect. 2.2, which consists of 100 sentences generated 
randomly with the HHMM in Fig. 1 (b). For each algorithm, a set of rules R' 
— ^ ^ h\Qi € Qpro^ 1 E ^ E 3} is given. 

With the EM algorithm, parameters T^ij, and bi^h are initialized ran- 
domly. With the approximate algorithm and the hybrid algorithm, weights p 
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Fig. 4. Curves of the log-likelihood: (a) the log-likelihood of training sentences over 
iterations, (b) the log-likelihood of training sentences over time for updating parame- 
ters. 



and are initialized randomly, and we put rj = 1.0 {t < 10), r] = 10/t 
{t > 10), and A = 0.1, where t is the number of iterations. For updating pa- 
rameters, the EM algorithm uses all sentences, and the approximate algorithm 
selects sentences one by one. In the hybrid algorithm, we set Chybrid = 0.005 and 

^hybrid — 20. 

For each algorithm, training was carried out 100 times in which only ini- 
tial parameters (the EM algorithm) or weights (the approximate algorithm and 
the hybrid one) differed from each other, and ran programs on Pentium II 450 
MHz with FreeBSD. Experimental results, averages of the log-likelihood of 100 
sentences over iterations and those over time,® are shown in Fig. 4. 

In (a), unlike the generalized Baum- Welch algorithm, the EM algorithm al- 
ways increased the log-likelihood as we have proved. In (b), the training curve of 
the hybrid algorithm increased fastest. In the hybrid algorithm, the jump of the 
log-likelihood is caused by changing from increasing the log-likelihood to max- 
imizing the auxiliary function Q(9,9'). This result is not general, but suggests 
that rough estimation by the approximate algorithm provide a set of plausi- 
ble initial parameters for the EM algorithm, and cause time for training to be 
reduced. 



5 Conclusion 

We have derived a simplified EM algorithm and an approximate algorithm for 
training hierarchical hidden Markov models (HHMMs). We have also shown 
HHMMs are efficiently representable with stochastic context-free grammars. 

We need to carry out more experiments to measure the performance of our 
algorithms. We are currently planning to apply these algorithms to practical 
domains. 

® We did not compare the log-likelihood of the generalized Baum- Welch algorithm 
with those of the proposed algorithms over time since the generalized Baum- Welch 
algorithm does not estimate parameters correctly as we have seen in Sect 2.2. 
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A: Proof of Lemma 1 

Let Qi be an arbitrary internal state, and fbc I > 1. qj denotes a substate of 
Qi, and qk a neighbor state of qt. Suppose (a) LM{qj,l) = Lg^Aj,!) and (b) 
Lnigh, ^ - 1) = LciAk, I - 1) for any qj and any q^. 

We first show that Wi G La{Ai,l) if Wi G LM{qi,l) for any Wi. Let Wi 
be a subsentence of qi such that |wi| < 1. For Wi, two cases are distinguished 
by horizontal transitions from qi. One is when we move from qi to by a 
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horizontal transition. It then holds that Wi = wj for some wj G LM{q_j,l)- In 
addition, a horizontal transition from qi to is available iff (Ai ^ A j) G R. 
By the assumption (a), it also holds that Ai Aj 4> Wj = Wi. 

The other is when we move from qi to qj by a horizontal transition. It then 
holds that Wi = WjWk for some Wj G LM{qj,l) and Wk G LM{qkJ ~ !)• In 
addition, a horizontal transition from qi to q^ is available iff (Ai AjAf^) G R. 
By the assumptions (a) and (b), it also holds that Ai AjAk^WjWk = Wi. 
We therefore obtain that Wi G La{Ai, 1) if Wi G LM{qi, 1) for any Wi. 

For the converse, it can be shown by similar arguments that Wi G LM{qi,l) 
if Wi G La{Ai, 1) for any Wi. □ 



B: Proof of Proposition 1 

The proof involves a double induction. We wish to prove LM{qi,l) = LciAiJ) 
by backward induction on the depth d of qi. But in order to prove the result for 
d given the result for d + 1, we must also do an induction on 1. 

First, let qi> be a state at depth D. Since qi> does not have any submodel, qe 
must be a production state. For we obtain LM{qi' , 1) = LciAi', 1) since h G 
sym{i') iff (Ai> ah) G R. We then assume I > 2. Now that a forced transition 
from qi> to its parent state immediately occurs after only one symbol is output, it 
holds that LM{qi>,l) = LM{qi'A)- On the other hand, for Ai', R contains only 
rules Ai! a such that a G S. This results in that La{Aii,l) = Lc{Ai^,l)- 
From LM{qi'A) = we therefore obtain LM{qi',l) = La{Ai/,l) for 

any state g^/ at depth D and any 1. 

Second, let qi be a state at depth d < Z?. If g^ is a production state, the 
discussion for a state at depth D is directly applicable to qi. We then only have 
to consider that qi is an internal state. The hypothesis for the induction on d is 
that LM{qj,l) = Lc^Aj,!) for any state qj at depth d + 1 and any 1. It holds 
that LM{qj, 1) = LciAj, 1) and LM{qi,0) = LG{Ai,0) = 0. It then follows from 
Lemma 1 that LmiqiA) = Supoose I > 2. The hypothesis for the 

induction on I is that LM{qi,l — 1) = La{Ai,l — 1) for any state qi at depth 
d. From LM{qj,l) = Lc^Aj,!) and Lemma 1, it turns out that LM{qi,l) = 
LaiAiJ). 

We therefore conclude that LM{qi,l) = La{Ai,l) for any qi and any 1. □ 

C: Proof of Proposition 2 

This proof is based on that of the Inside-Outside algorithm in [9]. Let tt^j , Ui^k, 
and bi^h be the current parameters. For space limitation, we consider only TVij 
as a new parameter. For notational convenience, 0 and 6' denote the sets of the 
current parameters and the new parameters, respectively. Let r be an arbitrary 
parse tree for a sentence o“, and C{A a; r, o“) be the number of times that a 
rule A ^ a (a G {N U T)*) occurs in r. Then 



log F(r,o“ I d) 
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= ^ C(A ^ a; T,o“) log P(A —> a|0) 

{A^a)eR 

= X! ( X! C{Ai ^ AjAk',T,o'"\e) + C{A, ^ Aj;T,o'"\0)jlogTTi^^ 

■i=l LjGsub{i) \kGfwd{i) / 



+ ^ ^ ^ AjAfc;T,o") j logoi^fc 

+ ^ C(Ai o“) log ai, end + X! (^(Ai ^ r, o") log . 

j^sub{i) h^sym{i) 

On the other hand, an auxiliary function Q{0,9') is defined as 

V 

Q{o,o') = EE P(r|o“,0)logP(r,o“|0')- 



Since P{O\0') > P{O\0) if Q{0,0') > Q{0,0) [5], we will maximize Q{0,0') by 
Lagrange multiplier to increase the likelihood. The Lagrange function is 



j) = < 5 ( 6 ', 6 »') + .^* ( 1 - I - 

y jesub(i) ) 

where Zi is a constant for i. If dC{Tri^j) jd^ij is set to 0 to maximize Q{0^ O'), 






C{Ai ^ Aj^Tjo'') + C{Ai ^ AjAk,T,o'') 



k^fwd{i) 



holds. Incidentally, 



dP(o“ |6») 

dTViJ 



= ^P(r,o“|e)[c'(Ai^ A,;r,o“)+ ^ C(Ai ^ AjAk; T, o^)Y 

T V kGfwd{i) / 



Hence, to set = 0, it must hold that 



_ 1 ^ 7T,,,. dP{o-\0) ^ 

STTi,,- • 
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Abstract. The class of pattern languages was introduced by Angluin 
(1980), and a lot of studies have been undertaken on it from the theoret- 
ical viewpoint of learnabilities. However, there have been few practical 
studies except for the one by Shinohara (1982), in which patterns are 
restricted so that every variable occurs at most once. In this paper, we 
distinguish repetitive variables from those occurring only once within a 
pattern, and focus on the number of occurrences of a repetitive- variable 
and the length of strings it matches, in order to model the rhetorical 
device based on repetition of words in classical Japanese poems. Prelim- 
inary result suggests that it will lead to characterization of individual 
anthology, which has never been achieved, up till now. 



1 Introduction 

Recently, we have tackled several problems in analyzing classical Japanese po- 
ems, Waka. In [12], we successfully discovered from Waka poems characteristic 
patterns, named Fushi, which are read-once patterns whose constant parts are 
restricted to sequences of auxiliary verbs and postpositional particles. In [10], we 
addressed the problem of semi-automatically finding similar poems, and discov- 
ered unheeded instances of Honkadori (poetic allusion), one important rhetorical 
device in Waka poems based on specific allusion to earlier famous poems. On 
the contrary, we in [11] succeeded to discover expression highlighting differences 
between two anthologies by two closely related poets (e.g., master poet and 
disciples). In the present paper, we focus on repetition. 

Repetition is the basis for many poetic forms. The use of repetition can 
heighten the emotional impact of a piece. This device, however, has received 
little attentions in the case of Waka poetry. One of the main reasons might be 
that a Waka poem takes a form of short poem, namely, it consists only of five lines 
and thirty-one syllables, arranged 5-7-5-7-7, and therefore the use of repetition 
is often considered to waste words (letters) under this tight limitation. In fact, 
some poets/scholars in earlier times taught their disciples never to repeat a word 
in a Waka poem. They considered word repetition as ‘disease’ to be avoided. This 
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device, however, gives a remarkable effect if skillfully used, even in Waka poetry. 
The following poem, composed by priest Egyb (lived in the latter half of the 
10th-century), is a good example of repetition, where two words ‘nawo’ and 
‘kiku’ are respectively used twice 

Ha-shi-no- na-wo / na- wo -u-ta-ta-ne-to / ki-ku -hi-to-no / 

ki-ku -ha-ma-ko-to-ka/u-tsu-tsu-na-ga-ra-ni (Egyo-Shu #195) 

Since there has been few studies on this poetic device in the long research 
history of Waka poetry, it is necessary to develop a method of automatically 
extracting (candidates for) instances of the repetition from database. To retrieve 
instances of repetition like above, we consider the pattern matching problem for 
patterns such as -kx-kx-ky-ky-k, where k is the variable-length don’t care (VLDC), 
a wildcard that matches any strings, and x, y are variables that match any non- 
empty strings. 

Recall the pattern languages proposed by Angluin [2] . A pattern is a string in 
n = (AUR)’*', where V is an infinite set {xi,X 2 , . . . } of variables and SC\V = 0. 
For example, axihx 2 Xi is a pattern, where o, 6 € E. The language of a pattern 
7T is the set of strings obtained by replacing variables in tt by non-empty strings. 
For example, L{axibx 2 X\) = {aubvu \u,v G -F+}. 

Although the membership problem is NP-complete for the class of Angluin 
patterns as shown in [2], it becomes polynomial-time solvable when the num- 
ber of variables occurring within tt is bounded by a fixed number k. Several 
subclasses have been investigated from the viewpoint of polynomial-time learn- 
ability. For example, the classes of read-once patterns (every variable occurs only 
at once) and one-variable patterns (only one variable is contained) are known to 
be polynomial-time learnable [2]. In the present paper, we try to study subclasses 
from viewpoints of pattern matching and similarity computation. 

It should be mentioned that the class of regular expressions with back refer- 
encing [1] is considered as a superclass of the Angluin patterns. The membership 
for this class is also known to be NP-complete. 

On the other hand, we attempted in [10] to semi-automatically discover sim- 
ilar poems from an accumulation of about 450,000 Waka poems in a machine- 
readable form. As mentioned above, one of the aims was to discover unheeded 
instances of Honkadori. The method is simple: Arrange all possible pairs of po- 
ems in decreasing order of their similarities, and then scholarly scrutinize a first 
part. The key to success in this approach is how to develop an appropriate simi- 
larity measure. Traditionally, the scheme of weighted edit distance with a weight 
matrix may have been used to quantify affinities between strings. This scheme, 
however, requires a fine tuning of quadratically many weights in a matrix with 
the alphabet size, by a hand-coding or a heuristic criterion. As an alternative 
idea, we introduced a new framework called string resemblance systems (SRSs 

^ We inserted the hyphens between syllables, each of which was written as one 
Kana character althongh romanized here. One can see that every syllable consists of 
either a single vowel or a consonant and a vowel. Thns there can be no consonantal 
clusters and every syllable ends in one of the five vowels a, i, u, e, o. 
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for short) [10]. In this framework, similarity of two strings is evaluated via a 
pattern that matches both of them, with the support by an appropriate func- 
tion that associates the quantity of resemblance candidate patterns. This scheme 
bridges a gap between optimal pattern discovery (see, e.g., [5]) and similarity 
computation. 

An SRS is specified by (1) a pattern set to which common patterns belong, 
and (2) a pattern score function that maps each pattern in the set to the quantity 
of resemblance. For example, if we choose the set of patterns with VLDCs and 
define the score of a pattern to be the number of symbols in it, then the obtained 
measure is the length of the longest common subsequence (LCS) of two strings. 
In fact, the strings acdeba and abdac have a common pattern a*d*a* which 
contains three symbols. 

With this framework one can easily design and modify his/her measures. 
In fact we designed some measures as combinations of pattern set and pattern 
score function along with the framework, and reported successful results in dis- 
covering unnoticed instances of Honkadori [10]. The discovered affinities raised 
an interesting issue for Waka studies, and we could give a convincing conclusion 
to it: 

1. We have proved that one of the most important poems by Fujiwara-no- 
Kanesuke, one of the renowned thirty-six poets, was in fact based on a model 
poem found in Kokin-Shu. The same poem had been interpreted just to show 
“frank utterance of parents’ care for their child.” Our study revealed the 
poet’s techniques in composition half hidden by the heart-warming feature 
of the poem by extracting the same structure between the two poems^. 

2. We have compared Tametada-Shu, the mysterious anthology unidentified in 
Japanese literary history, with a number of private anthologies edited af- 
ter the middle of the Kamakura period (the 13th-century) using the same 
method, and found that there are about 10 pairs of similar poems between 
Tametada-Shu and Sbkon-Shu, an anthology by Shbtetsu. The result sug- 
gests that the mysterious anthology was edited by a poet in the early Muro- 
machi period (the 15th-century). There have been surmised dispute about 
the editing date since one scholar suggested the middle of Kamakura period 
as a probable one. We have had a strong evidence about this problem. 

In this paper, we focus on the class of Angluin patterns and on its subclasses, 
and discuss the problems of the pattern-matching, the similarity computation, 
and the pattern discovery. It should be emphasized that although many studies 
has been undertaken to the class of Angluin patterns and its subclasses, most 
of them has been done from the theoretical viewpoint of learnability. The only 
exception is due to Shinohara [9]. He mentioned practical applications, but they 
are limited to the subclass called the read-once patterns (referred to as regular 
patterns in [9]). We show in this paper the first practical application of Angluin 

^ Asahi, one of Japan’s leading newspapers, made a front-page report of this discovery 
(26 May, 2001). 
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patterns that are not limited to the read-once patterns. As our framework quan- 
tifies similarities between strings by weighting patterns common to the strings, 
we modify the definition of patterns as follows: 

— Substitute a gap symbol * for every variable occurring only once in a pattern. 

— Associate each variable x with an integer /r(x) so that the variable x matches 
a string w only if the length of w is at least /i(x) . (In the original setting in 
[2], ^{x) = 1 for all variable x.) 

Since we are interested only in repetitive strings in a Waka poem, there is no 
need to name non-repetitive strings. It suffices to use gap symbols * instead of 
variables for representing non-repetitive strings. Thus, the first item is rather 
for the sake of simplification. On the contrary, the second item is an essential 
augmentation by which the score of a pattern tt can be sensitive to the values 
of pi(x) for variables x in tt. In fact, we are strongly interested in the length of 
repeated string when analyzing repetitive expressions in Waka poems. 

Fig. 1 is an instance of Honkadori we discovered in [10]. The two poems have 
several common expressions, such as, “na-ka-ra-he-te” and “to-shi-so-he-ni-ke- 
ru.” One can notice that both the poems use the repetition of words. Namely, 
the Kokin-Shu poem and the Shin-Kokin-Shti repeat “nakara” (stem of verb 
“nagarafu”; name of a bridge) and “matsu” (wait; pine tree), respectively. This 
strengthens the affinities based on existence of common substrings. 



Poem alluded to. (Kokin-Shu #826) Sakanoue-no-Korenori. 



A-FU-KO-TO-WO 

NA-KA-RA-NO-HA-SHI-NO 

NA-KA-RA-HE-TE 

KO-HI-WA-TA-RU-MA-NI 

TO-SHI-SO-HE-NI-KE-RU 



Without seeing you, 

I have lived on 

Adoring you ever 

Like the ancient bridge of Nagara 

And many years have passed on. 



Allusive-variation. (Shin-Kokin-Shu #1636) Nijoin Sanuki. 



NA-KA-RA-HE-TE 

NA-HO-KI-MI-KA-YO-WO 

MA-TSU-YA-MA-NO 

MA-TSU-TO-SE-SHI-MA-NI 

TO-SHI-SO-HE-NI-KE-RU 



Like the ancient pine tree of longevity 

On the mount of expectation called “Matsuyama, ” 

I have lived on 

Expecting your everlasting reign 
And many years have passed on. 



Fig. 1. Discovered instance of poetic allusion. 



It may be relevant to mention that this work is a multidisciplinary study 
between the literature and the computer science. In fact, the second author 
from the last is a Waka researcher and the last author is a linguist in Japanese 
language. 

2 A Uniform Framework for String Similarity 

This section briefly sketches the framework of string resemblance systems ac- 
cording to [10]. Gusfield [6] pointed out that in dealing with string similarity 
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the language of alignments is often more convenient than the language of edit 
operations. Our framework is a generalization of the alignment based scheme 
and is based on the notion of common patterns. 

Before describing our scheme, we need to introduce some notation. The set 
of strings over an alphabet E is denoted by E*. The length of a string u is 
denoted by |m|. The string of length 0 is called the empty string, and denoted by 
£. Let = E* — {e}. Let us denote by R the set of real numbers. A pattern 
system is a triple of a finite alphabet E, a set II of descriptions called patterns, 
and a function L that maps a pattern in 7T to a subset of E* . L{tt) is called 
the language of a pattern tt € 7T. A pattern tt G 7T match a string w € E* w 
belongs to A pattern tt in 77 is a common pattern of strings wi and W 2 in 
E* if 7T matches both of them. 

Definition 1. A string resemblance system (SRS) is a 4~tuple {E, 77, L, score), 
where {E, 77, L) is a pattern system and score is a pattern seore function that 
maps a pattern in II to a real number. 

The similarity SIM(x, y) between strings x and y with respect to {E, 77, L, score) 
is defined by SIM(a;,?/) = max{score(7r) | tt G 77 and x,y & L{n) }. When the 
set {score(Tr) | tt G 77 and x,y G L(t^) } is empty or the maximum does not 
exist, SIM(x,y) is undefined. 

The above definition regards similarity computation as optimal pattern dis- 
covery. Our framework thus bridges a gap between similarity computation and 
pattern discovery. In [10], we defined the homomorphic SRSs and showed that 
the class of homomorphic SRSs covers most of the known similarity (dissim- 
ilarity) measures, such as, the edit distance, the weighted edit distance, the 
Hamming distance, the LCS measure. We also extended in [10] this class to the 
semi-homomorphic SRSs, and the similarity measures we developed in [8] for 
musical sequence comparison fall into this class. 

We can handle a variety of string (dis) similarity by changing the pattern sys- 
tem and the pattern score function. The pattern systems appearing in the above 
examples are, however, restricted to homomorphic ones. Here, we shall mention 
SRSs with non-homomorphic pattern systems An order-free pattern (or fragmen- 
tary pattern) is a multiset {mi, . . . , Uk} such that k > 0 and ui, . . . ,Uk G A+, 
and is denoted by 7 t[ui, . . . , Uk\. The language of pattern 7 t[ui, . . . , Uk] is the set 
of strings that contain the strings Ui, . . . ,Uk without overlaps. The membership 
problem of the order-free patterns is NP-complete [7], and the similarity compu- 
tation is NP-hard in general as shown in [7]. However, the membership problem 
is polynomial-time solvable when k is fixed. The class of order- free patterns plays 
an important role in finding similar poems from anthologies of Waka poems [10]. 

The pattern languages, introduced by Angluin [2], is also interesting for our 
framework. 

Definition 2 (Angluin pattern system). The Angluin pattern system is a 
pattern system {E, (E UV)~^ , L) , where V is an infinite set {xi,X 2 , ■ . ■} of vari- 
ables with E C\V = 0, and L{tt) is the set of strings tt • 9 such that 6 is a 
homomorphism from {E U V)^ to 77+ such that c - 0 = c for every c G E. 
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In this paper we discuss SRSs with the Angluin pattern system. 

3 Computational Complexity 

Definition 3. Membership Problem for pattern system (A, 7T, L). 
Given a pattern tt G U and a string w € S* , determine whether or not w G L{tt) ■ 

Theorem 1 ([2]). Membership problem for Angluin pattern system is 
NP-complete. 

Definition 4. Similarity Computation with respect to SRS (A, 77, L, 
score). Given two strings W\,W 2 G S* , find a pattern tt G II with {wi,W 2 } Q 
L{tt) that maximizes score(Tr). 

Theorem 2. For an SRS with Angluin pattern system, Similarity Computa- 
tion is NP-hard in general. 

Proof. We consider the following problem, that is a decision version of a spe- 
cial case of Similarity Computation with wi = W 2 , and show its NP- 
completeness. Optimal Pattern with respect to SRS {S, II, L, score): 
Given a string w G A* and an integer k, determine whether or not there is a 
pattern tt G II such that w G L{tt) and score(Tr) > k. 

We give a reduction from Membership Problem for Angluin pattern 
SYSTEM (A, 77, L) to Optimal Pattern with respect to SRS with Angluin 
pattern system (A', 77', V , score) for a specific score function score defined as 
follows. Let A' = A U {ff} with ff ^ S. We take a one-to-one mapping (•) from 
77' = (A U P)+ to A* that is log-space computable with respect to \tt\. We 
define the score function score : 77' — > i? by score(Tr') = 1 if tt' is of the form 
tt' = ■7r#(7r) for some tt G 77 = (A U V)'^ , and score(Tr') = 0 otherwise. 

For a given instance tt G II and w G S* of Membership Problem for 
Angluin pattern system, let us consider w' = wf={Tr) and fc = 1 as an input 
to Optimal Pattern. Then we can see that there is a pattern tt' g 77' with 
w' G I(tt') and score(Tr') = 1 if and only if ic G L(7 t), since iv' G L{tt') if and 
only if tt' = 7r#(7r) and w G L{tt). This completes the proof. □ 

4 Practical Aspects 

Recall that similarities between strings are quantified by weighting patterns com- 
mon to them in our framework. For a finer weighting, we augment the descrip- 
tive power of Angluin patterns by putting a restriction on the length of a string 
matched by each variable. Namely, we associate each variable x with an integer 
pi{x) such that the variable x matches a string w only if p,{x) < For example, 
suppose that tti = Z\XZ 2 XZ^ and tt 2 = ziyz 2 yz^, where p,{x) = 2, p,{y) = 3, and 
/i(zi) = p,{z 2 ) = = 0- Then, tti is common to the strings bcaaabbaac and 

acabbaabbbb, but tt 2 is not. This enables us to define a score function so that it 
is sensitive to the lengths of strings substituted for variables. 
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On the other hand, as we have seen in the last section, similarity computation 
as well as membership problem is intractable in general for Angluin pattern 
system. From a practical point of view, it is valuable to consider subclasses of 
the pattern system that are tractable. 

Let occx (tt) denote the number of occurrences of a variable x within a pattern 
TT G {S U V)~^ . For example, ocCx{abxcyxbz) = 2. A variable x is said to be 
repetitive w.r.t. tt if ocCx{tt) > 1. A pattern tt is said to be read-once if tt contains 
no repetitive variables. Historically, read-once patterns are called regular patterns 
because the induced languages are regular [9]. The membership problem of the 
read-once patterns is solvable in linear time. A k-repetitive-variable pattern is a 
pattern that has at most k repetitive- variables. It is not difficult to see that: 

Theorem 3. The membership problem of the k-repetitive-variable patterns can 
he solved in time for input of size n. 

That is, non-repetitive variables do not matter. Moreover, we are interested 
only in repeated strings in text strings. For these reasons, we substitute * for 
each of the non-repetitive variables in a pattern. Patterns are then strings over 
(L7UPU{*}), in which every variable is repetitive. For example the above pattern 
abxcyxbz is written as abxc * xb*. 

Despite the polynomial-time computability, the membership problem of the 
fc-repetitive- variable patterns requires much time to solve. The similarity compu- 
tation is therefore very slow in practice. For this reason, we in this paper restrict 
ourselves to the case of fc = 1, namely, the one-repetitive-variable patterns. In 
order to efficiently solve the membership problem and similarity computation for 
this class, we utilize a kind of filtering technique. For example, when the pattern 
a-kxxb-kcx matches a string w, then the candidate strings for substituting for x 
must occur at least three times in w without overlaps. We obtain such substring 
statistics on a given string w by exploiting such data structures as the minimal 
augmented suffix trees developed by Apostolico and Preparata [3,4]. 

Suffix tree [6] for a string w is a tree structure that represents all suffices of 
w as paths from the root to leaves, so that every node except leaves have at least 
two children. Suffix trees are useful for the task of various string processing [6]. 
Each node v corresponds to a substring v of w. For each internal node v, we 
associate the number of leaves of the subtree rooted at v. It corresponds to the 
number of (possibly overlapped) occurrences h in w to the node (see Fig. 2 (a)). 

Minimal augmented suffix tree is an augmented version of the suffix tree, 
where additional nodes are introduced to count non-overlapping occurrences, 
(see Fig. 2 (b)). 



5 Application to Waka Data 



In this section, we present and discuss the results of our experiments carried 
out on the Eight Imperial Anthologies, the first eight of the imperial anthologies 
compiled by emperor commands, listed in Table 1. 
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Fig. 2. (a)SufEx tree and (b)minimal augmented suffix tree for string ababaababa%. 
The number associated to each internal node denotes the number occurrences of the 
string in the string, where occurrence means possibly overlapped occurrence in (a) and 
non- overlapped occurrence in (b). For example, the string aha occurs four times in the 
string ababaababa, but it appears only three times without overlapping. 

Table 1. Eight Imperial Anthologies. 



no. 


anthology 


compilation 


^ poems 


I 


Kokin-Shu 


905 


1,111 


II 


Gosen-Shti 


955-958 


1,425 


III 


Shui-Shti 


1005-1006 


1,360 


IV 


Go-Shui-Shti 


1087 


1,229 


V 


Kinyd-Shu 


1127 


717 


VI 


Shika-Shti 


1151 


420 


VII 


Senzai-Shti 


1188 


1,290 


VIII 


Shin-Kokin-Shti 


1216 


2,005 



5.1 Similarity Computation 

For a success in discovery, we want to put an appropriate restriction on the 
pattern system and on the pattern score function by using some domain knowl- 
edge. However, there are few studies on repetition of words in Waka poems as 
stated before, and therefore we do not in advance know what kind of restriction 
is effective. 

We take a stepwise-refinement approach, namely, we start with very simple 
pattern system and score function, and then improve them based on analysis of 
obtained results. Here we restrict ourselves to one-repetitive-variable patterns. 
Moreover, we use a simple pattern score function that is not sensitive to charac- 
ters or VLDCs in the patterns. Namely, the score of a-kxxb-kcx is identical to that 
of -kx-kX'kx-k, for example. Despite this simplification, we wish to pay attention to 
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how long the strings that match variable x are. Thus, a one-repetitive-variable 
pattern tt is essentially expressed as two integers: ocCx{tt) and n{x). We assume 
that the score function is non-decreasing with respect to 0CCx(7t) and to /J-(x). 

We compared the anthology Kokin-Shu with two anthologies Gosen-Shu 
and Shin-Kokin-Shu. The score function we used is defined by score^n) = 
0CCx(7t) ■ fj,(x). The frequency distributions are shown in Table 2. From the ta- 



xable 2. Frequency distribution on similarity values in comparison of Kokin-Shu with 
Gosen-Shu and Shin-Kokin-Shu. Note that similarity values cannot be 1, 2, 3, 5, 7 
because of the definition of the pattern score function. The frequencies for any similarity 
values not present here are all 0. 





0 


4 


6 


8 


10 


Gosen-Shu 


1,390,030 


178,331 


1,944 


37 


8 


Shin-Kokin-Shu 


1,962,550 


244,776 


2,173 


11 


0 



ble, there seem relatively higher similarities between Kokin-Shu and Gosen-Shu, 
compared with Kokin-Shu and Shin-Kokin-Shu. We examined a first part of a 
list of poem pairs arranged in the decreasing order of similarity value. However, 
we had impressions that most of pairs with high similarity value are dissimi- 
lar, probably because the pattern system we used is too simple to quantify the 
affinities concerning repetition techniques. See the poems shown in Fig. 3. All 
the poems are matched by the pattern *x * x* with /ii(x) = 4. The first three 
poems are similar each other, while the other pairs are dissimilar. It seems that 
information about the locations at which a string occurs repeatedly is important. 



ka-su-ka-no-ha/ke-fu-ha-na-ya-ki-so/wa-ka-ku-sa-no/ 

TSU-MA- MO-KO-MO-RE-Rl /wA-RE- MO-KO-MO-RE-Rl / (KOKIN-ShU #17) 

to-shi-no-u-chi-ni/ha-ru-ha-ki-ni-ke-ri/hi-to-to-se-wo / 

ko-so- to-ya-i-ha-mu /ko-to-shi- to-ya-i-ha-mu / (Kokin-Shu #1) 

hi-ru-na-re-ya/mi-so-ma-ka-he-tsu-ru/tsu-ki-ka-ke-wo/ 

ke-fu- to-ya-i-ha-mu /ki-no-fu- to-ya-i-ha-mu / (Gosen-Shu #1100) 

ha-ru-ka-su-mi/ta-te-ru-ya-i-tsu-ko/mi- yo-shi-no-no / 

YO-SHI-NO-NO -YA-MA-NI / YU-KI-HA-FU-RI-TSU-TSU / (KOKIN-ShU #3) 

tsu-ra-ka-ra -ha/o-na-shi-ko-ko-ro-ni / tsu-ra-ka-ra -m / 

tsu-re-na-ki-hi-to-wo/ko-hi-m-to-mo-se-su/ (Gosen-Shu #592) 



Fig. 3. Poems that are matched by the same pattern *x * x-k with /i(a;) = 4. All pairs 
have a unique similarity value. The first three poems can be considered to ‘share’ the 
same poetic device and are closely similar, while some pairs are dissimilar. 
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Moreover, we observed that there are a lot of meaningless repetitions of 
strings, especially when /r(x) is relatively small, say, fi{x) = 2. It seems better to 
restrict ourselves to repetition of strings occurring at the beginning or the end 
of a line in order to remove such repetitions. 

We assume the lines of a poem are parenthesized by [, ] . Then, the pattern 
[*] [a;*] [a;*] [*] [*] , for example, matches any poem whose second and third lines 
begin with a same string. We want to use the set of such patterns as the pattern 
set, but the number of such patterns is 3^ = 243, which makes the similar- 
ity computation impractical. However, by using the Minimal Augmented Suffix 
Trees, we can filter out a wasteful computation and perform the computation in 
reasonable time. The results are shown in Table 3. By examining a first part, we 
confirmed that this time pairs with a high similarity value are closely similar. 



Table 3. Improved results. Frequency distribution on similarity values in comparison 
of Kokin-Shu with Gosen-Shu and Shin-Kokin-Shti. Note that similarity values cannot 
be 1, 2, 3, 5, 7 because of the definition of the pattern score function. The frequencies 
for any similarity values not present here are all 0. 





0 


4 


6 


8 


10 


Gosen-Shu 


1,569,925 


407 


14 


1 


3 


Shin-Kokin-Shti 


2,208,888 


583 


39 


0 


0 



5.2 Characterization of Anthologies 



Table 4 shows the most 30 patterns occurring in Kokin-Shu. The table illustrates 
variations of word repetition techniques. 



Table 4. Most frequent 30 patterns in Kokin-Shu. 



freq. 


pattern 


freq. 


pattern 


freq. 


pattern 


11 


[*][*] [x*][x*][*l 


3 


[*x][*][*x] [*][★] 


1 


[x*] [*] [x*] [*] [*x] 


10 


[x*][x*] [*][*][★] 


3 


[*][x*][*][*][x*] 


1 


[x*] [*][*] [*][**] 


10 


[*] [a;*] [*★] [*] [★] 


3 


[*][*x][*][*x][*] 


1 


[*x][*][x*] [*][*] 


7 


[a:*][*][*][*][a:*] 


3 


[*][*] [x*][*][x*] 


1 


[*x] [*1 [*] w [*®] 


5 


[*][*a;][*][*][*x] 


3 


[*][*] [*][*x][*x] 


1 


[*][x*][*x] [*][*] 


5 


[*] [★] [*®] [x*] [★] 


2 


[x*][*][x*] [*][*] 


1 


[*][x*][*][x*][*] 


5 


[*][★] [*][x*][x*l 


2 


[**]W [*][**][*] 


1 


[*][*x][x*] [*][*] 


4 


[x*][*][*][x*][*l 


2 


[*x][*][*][*][x*] 


1 


[*][★] [x*][*x][*] 


4 


[*x][x*] [★][*][*] 


2 


[*][*x][*x] [*][★] 


1 


[*] M [a^*] W [*®] 


4 


[*][*] [*x][*x][*] 


1 


[x*] [x*] [★] [*] [x*] 


0 


[x*] [x*] [x*] [x*] [x*] 
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For every pattern of the above mentioned form, we collected the poems that 
are matched by it from the first eight imperial anthologies shown in Table 1. 
The results are summarized in Table 5. The first four anthologies have a 



Table 5. Characterization of anthologies. I, II, III, IV, V, VI, VII, VIII represent 
Kokin-Shit, Gosen-Shii, Shiii-Shii, Go-Shiii-Shu, Kinyo-Shii, Shika-Shu, Senzai-Shu, 
Shin-Kokin-Shii, respectively, 



(occ^iiv), fl{x)) 


I 


II 


III 


IV 


V 


VI 


VII 


VIII 


(2,2) 


96 


104 


118 


108 


24 


22 


77 


112 


(2,3) 


23 


20 


28 


31 


5 


9 


17 


19 


(2,4) 


10 


7 


13 


5 


4 


5 


3 


1 


(2,5) 


5 


5 


10 


3 


2 


2 


1 


0 


(3,2) 


2 


11 


2 


3 


0 


1 


1 


0 


(3,3) 


0 


0 


0 


2 


0 


1 


0 


0 


(3,4) 


0 


0 


0 


0 


0 


0 


0 


0 


(3,5) 


0 


0 


0 


0 


0 


0 


0 


0 


(4,2) 


0 


5 


0 


0 


0 


0 


0 


0 


(4,3) 


0 


0 


0 


0 


0 


0 


0 


0 


(4,4) 


0 


0 


0 


0 


0 


0 


0 


0 


(4,5) 


0 


0 


0 


0 


0 


0 


0 


0 


(5,2) 


0 


1 


0 


0 


0 


0 


0 


0 


(5,3) 


0 


0 


0 


0 


0 


0 


0 


0 


(5,4) 


0 


0 


0 


0 


0 


0 


0 


0 


(5,5) 


0 


0 


0 


0 


0 


0 


0 


0 



considerable amount of poems that use repetition of words, even for a large 
value of /t(x). This is contrasted with Shin-Kokin-Shii where limited to a small 
value of /i(a:). This might be a reflection of the editor’s preferences or of literary 
trend. Anyway, pursuing the reason for such differences will provide clues for 
further investigation on literary trend or the editors’ personalities. 

6 Concluding Remarks 

The Angluin pattern language has been studied mainly from theoretical view- 
points. There are no practical applications except for those limited to the read- 
once patterns. This paper presented the first practical application of the Angluin 
pattern languages that are not limited to read-once patterns. We hope that pat- 
tern matching and similarity computation for the patterns discussed in this paper 
possibly lead to discovering overlooked aspects of individual poets. 

We distinguished repetitive variables (i.e., occurring more than once in a 
pattern) from non-repetitive variables, and associated each variable x with an 
integer ^(x) as the lower bound to the length of strings the variable x matches. 
This enables us to give a pattern score depending upon the lengths of strings 
substituted for variables. For one-repetitive- variable pattern, we presented a way 
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of speed-up of pattern matching, which uses substring statistics from minimal 
augmented suffix tree of a given string as a filter that excludes patterns which 
cannot match it. Preliminary experiment showed this idea successfully speeds 
up the pattern matching against many patterns repeatedly. 

In this paper, we restricted ourselves to one-repetitive-variable patterns and 
to repetition of words which occur at the beginning or the end of lines of Waka 
poem. The restriction played an important role but we want to consider a slightly 
more complex patterns. For example, the following two poems are matched by 
the pattern [*] [*] [x*] [xx-k] [*] . 

[SHI-RA-YU-KI-NO] [yA-HE-FU-RI-SHI-KE-RU] [ kA-HE-RU -YA-MA] 

[ ka-he-ru - ka-he-ru -mo] [o-i-ni-ke-ru-ka-na1 (Kokin-Shu #902) 

[a-fu-ko-to-ha] [ma-ha-ra-ni-a-me-ru] [ i-yo -su-ta-re] 

f l-YO - I-YO -WA-RE-WO] [wA-HI-SA-SU-RU-KA-NA] (ShIKA-ShU #244) 

Moreover, the next poem is matched by the pattern that 

contains two-repetitive- variables. 

f wA-SU-RE -SHI-TO] [I-HI-TSU-RU-NA-KA-HA] f wA-SU-RE -KE-Rl] 

f wA-su-RE -MU-TO-KO-so] [i-fu-he-ka-ri-ke-re] (Go-Shui-Shu #886) 

To deal with more general patterns like these ones will be future work. 
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Abstract. This paper describes a web site rating and improvement 
method that automatically suggests how to improve the web site based 
on a hyperlink structure. First, web site visualization using the three- 
dimensional hyperbolic tree shows us a map of the web site. This allows 
us to understand the overall web site structure and to discover where 
information is concentrated or is missing. In visualizing the web site, the 
web site rating is done from six viewpoints by analyzing all descriptions 
of homepages contained in the web site. The rating result is then ex- 
pressed as a radar chart. Furthermore, some features of branches, which 
contain some homepages located in a lower layer, are extracted using a 
machine learning technique. If no feature is extracted, we can understand 
that information is not well-organized in the lower-layer of the branch. 
In contrast, branches that have the same features are combined by an 
additional hyperlink. Feature extracting of each branch in a web site 
automatically yields generating suggestions for improving the hyperlink 
structure. 



1 Introduction 

In general, a web site structure has a disorderly extension of hyperlinks. Such a 
structure does not lend itself to information retrieval in advance. In this paper, 
we focus on how to construct a web site for information retrieval rather than 
information retrieval such as by a search engine. We then propose a method for 
web site rating based on hyperlink structure and a method for automatically 
generating suggestions to improve hyperlink structures. 

Yakov Nielsen discussed web site design from the viewpoint of web usability in 
his book [Nielsen, 2000]. According to his book, a web site should be constructed 
based on common sense of the web from a user’s viewpoint. He also discussed 
web site structure. For example, a structure copying the organization is not 
so effective; it would be better to construct a web site that categorizes related 
information. 

To evaluate a web site hyperlink structure, we first use a three-Dimensional 
Hyperbolic Tree to visualize the web site’s overall structure. This enables us to 
understand the overall web site structure and to discover where information is 
concentrated or lacking. In visualizing the web site, the web site is rated from 
six viewpoints by analyzing all description of homepages contained in the web 
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site. The rating is then expressed as a radar chart. Furthermore, some features 
of branches that contain some homepages located in a lower layer are extracted 
using a machine learning technique. If no feature is extracted, we can understand 
that information in the lower layer of the branch is not well-organized. Branches 
that have the same feature are combined by additional hyperlinks. Feature ex- 
traction of each branch in a web site automatically generates suggestions for 
improving the hyperlink structure. 

Thus, our web rating reads all pages and presents a web site map. The rating 
results are expressed as the radar chart, and features of each branch are extracted 
to automatically generate suggestions for improving the hyperlink structure. 



2 Web Site Rating 

Our web site rating begins by clarifying the overall structure of a web site. We 
adopt a three-Dimensional Hyperbolic Tree to make a web site map (Figure 1 
left). The Hyperbolic Tree represents a web site as a tree ^ that is reflected in the 
Hyperbolic plane. This can show us the whole aspect of homepages contained 
in the web site. We can understand the scale of web site and discover where 
information is concentrated or lacking from the threeS-Dimensional Hyperbolic 
Tree. All pages are down loaded to make the three-Dimensional Hyperbolic Tree 
and are analyzed in order to extract six quantitative features expressed as a 
radar chart as in Figure 1 right. 

Scale 

This is an evaluation of the web site scale and an index indicating how many 
pages the web site contains. 

Update 

This is an evaluation of the content freshness and an index indicating how 
frequently the homepages are updated. 

Link 

This is an evaluation of navigating in the web site and an index indicating 
how many internal links the web site has. 

Portal 

This is an evaluation of hyperlinks to related sites and an index indicating 
how many external links to related sites exist in the web site. 

Media 

This is an evaluation of homepage design and an index indicating how many 
media, i.e., pictures or animation, are used in the web site. 

Structure 

This is an evaluation about web site structure and an index that shows 
whether the web site has the ideal structure. 



^ The top page of the web site becomes the root of the tree and the branches are 
expanded in the breadth-first manner. Hyperlinks to previously found pages are 
ignored. 
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3-Dimensional Hyperbolic Tree Rader Chart 



Fig. 1. Three-Dimensional Hyperbolic Tree and Rader Chart 



“Scale” is calculated by counting the homepages in the web site. “Update” 
is calculated by getting the last modified time of the homepages. “Portal,” 
“Link” and “Media” are obtained from tag information. For example, if the 
URL of the anchor tag is the URL of the same site, it is counted as the “Link”; 
if the URL is of a different site, it is counted as “Portal.” “Structure” is 
evaluated by counting the pages included in each layer. The ideal quantity of 
pages in each layer is defined in advance, and the system evaluates whether the 
quantity of pages on each layer is close to the ideal quantity. 

The standard structure in our method is that the quantity of pages is in- 
creased as the layer becomes deeper. The ideal web site structure is thus a 
pyramid in which the second layer has more pages than the first, the third layer 
has more than the second, etc. This is our heuristic based on the concept that 
there is an ideal quantity of branches from one page. In general, a web site has a 
directory structure that collects related information, for example, “car ^ maker 
^ parts,” and the number of categories tends to increase as the layer becomes 
deeper. Many hyperlinks in one page means that there are many branches for 
browsing. In this case, a possibility of the visitors deviating into the other branch 
becomes high, so it is not a good structure. Furthermore, a structure in which 
the quantity of pages becomes smaller as the layer becomes deeper may be in- 
efficient. This occurs when many hyperlinks are in a shallow layer. In this case, 
in order to get target information in a deeper layer, we may have to visit many 
meaningless branches in shallow parts. 
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3 Improvement of Hyperlink Structure 

We can evaluate a web site automatically from several viewpoints using our web 
rating method described above. If an expert evaluates a web site, we can ask the 
expert about improving the web site. However, it is difficult for us to understand 
which part should be improved and how we should improve the web site from only 
the evaluation value. In this section, therefore, we describe automatic generation 
of suggestions for improving web site structures. 

Web sites should be constructed from the user’s viewpoint, and it is important 
to categorize homepages that have same topic [Nielsen, 2000]. The suggestions 
generated by our method point out the portions where information is distributed 
and where portions dealing with the same topic should be combined. That is, 

“Information of this part is not well-organized. 

You should combine these portions.” 



Range of Branch A 




Fig. 2. Range to extract the branch feature 



The most important technique for automatically providing such suggestions 
is to extract the “branch feature,” which is a feature of homepages included in 
the lower layer of one node in the web site structured as a tree. Figure 2 shows 
the concept of the branch feature. The range of one branch is the lower part of 
the node, and the features of the homepages included in the branch are extracted 
as branch features. This clarifies what information is included in pages of the 
lower level. If the branch features cannot be extracted, information of pages in 
the branch is not well-organized. If the same branch feature is extracted from 
several branches, those branches can be combined as one branch. 

Keywords in an HTML document can be used as the features of a homepage. 
They are extracted by using morphological analysis tools. Image data and tag 
information can also be used as features of a homepage. Rupert Parson used 
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keywords, URLs and hyperlink relations as features of interesting homepages for 
users [Parson, 1998]. 

Figure 3 shows an example of branch features and homepage features. The 
end branch includes only one homepage, so the branch feature of the end branch 
is the same as the feature of the homepage. For example, the feature of the 
branch E is the feature of homepage E. Since homepage E has the features a, b 
and e, the branch features of E become a, b and e. Branch B also contains the 
homepages B, E and F. The branch features of B thus become a and b, which 
are common features among homepage B (a,b,d), homepage E (a,b,e) and 
homepage F (a,b,f). The branch feature of A is common to all homepages, so 
the branch feature of A becomes a. 




Thus, the machine learning technique can be used for extracting common 
features. The association rule [Agrowal,1994], which is a data mining technique, 
allows expression by the percentage in which this feature is included in 50% of 
the homepages even if the feature is not expressed in all pages. It also supports 
expression by the combination of features. Inductive Logic Programming (ILP) 
supports negative examples, so it can extract features that exist only in the 
branch. 



4 Improving Web Site Structure 

We introduce an example in which our web site rating contributed to the web 
site improvement. We performed free web site rating service on our web site for 
10 days in November 2000 At that time, the web site shown in Figure 4 did 
not have a lot of pages and we could see informational deviation clearly, so it 
was not a good site. However, the web administrator reconstructed the web site 

^ http://imct-sev.imc.sut.ac.jp/webrating/WebRating 
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Fig. 4. Example of web site improving 



after seeing this result. Three months later, the web site had changed to the 
orderly structure shown in Figure 4 right. Thus, our web rating can promote the 
improvement of a web site and provide effective web site information. 

5 Conclusions 

In this paper, we proposed an automatic web site rating and improvement 
method that suggests for improving the web site structure. Though several meth- 
ods based on the hyperlink structure have been proposed to improve the effi- 
ciency of information retrieval, we have focused on the hyperlink in the web 
site. In so doing, we realized a system for web site rating and web site structure 
improvement. 

Our web site rating system outputs six evaluation parameters on a radar 
chart. It can promote the improvement of a web site and provide effective in- 
formation of the web site. In addition, our web rating of sites of a specific type 
enables us to discover special features. 

By extracting the branch features, our system automatically suggests how 
to improve the web site structure. This shows us portions that are not well- 
organized and suggests combining branches that have similar information. It is 
especially effective for web sites constructed by several users. 
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Abstract. Episode pattern is a generalized concept of subsequence 
pattern where the length of substring containing the subsequence is 
bounded. Given two sets of strings, consider an optimization problem 
to find a best episode pattern that is common to one set but not com- 
mon in the other set. The problem is known to be NP-hard. We give a 
practical algorithm to solve it exactly. 



1 Introduction 

In these days, a lot of text data or sequential data are available, and it is quite 
important to discover useful rules from these data. Finding a good rule to separate 
two given sets, often referred as positive examples and negative examples, is a 
critical task in Discovery Science as well as Machine Learning. 

In [4], Hirao et al. considered subsequence patterns as rules. A subsequence 
pattern s matches with a string t if s can be obtained by deleting zero or more 
characters from t. They introduced a practical algorithm to find a best subse- 
quence pattern that separates positive examples from negative examples, and 
showed some experimental results. A drawback of subsequence patterns is that 
they are not suitable for classifying long strings over small alphabet, since a 
short subsequence pattern matches with almost all long strings. 

In this paper, we consider episode patterns, which were originally introduced 
by Mannila et al. [5]. An episode pattern {v, k), where u is a string and k is an 
integer, matches with a string t if u is a subsequence for some substring m of t 
with |u| < k. Episode pattern is a generalization of subsequence pattern since 
subsequence pattern v is equivalent to episode pattern (w, oo). We give a practical 
solution to find a best episode pattern which separates a given set of strings 
from the other set of strings. We propose a practical implementation of exact 
search algorithm that practically avoids exhaustive search. The key idea is to 
introduce some heuristics to reduce the search space based on the combinatorial 
properties of episode patterns, and to utilize an efficient data structure that 
helps to determine whether an episode pattern matches with a fixed string, at 
the cost of preprocessing time and space requirement to construct it. 
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2 Preliminaries 

Let N be the set of integers. Let S he & finite alphabet, and let S* be the set 
of all strings over S. For a string w, we denote by |w| the length of w. For a set 
S' C of strings, we denote by |S| the number of strings in S, and by ||S|| the 
total length of strings in S. We say that a string u is a prefix {substring, suffix, 
resp.) of w if w = vy {w = xvy, w = xv, resp.) for some strings x,y € S*. We say 
that a string u is a subsequence of a string w if f can be obtained by removing 
zero or more characters from w. We denote by v :<,t, w that is a substring of 
w, and by v w that v is a subsequence of w. An episode pattern is a pair of 
a string v and an integer k, and we define the episode language L'^^’{{v, k)) by 

L“^"‘{{v, k)) = {ic G A* I w such that v u and |m| < k}. 

We formulate the problem by following our previous paper [4] . Readers should 
refer to [4] for basic idea behind this formulation. We say that a function / from 
[0,a;max] X [Ojj/max] to real numbers is conic if 

— for any 0 < y < j/max, there exists an such that 

• f{x,y) > f{x',y) for any 0 < a: < x' < xi, and 

• f{x,y) < f{x',y) for any xi < x < x' < ^max- 

— for any 0 < x < Xmax> there exists a y\ such that 

• f{x, y) > f{x, y') for any 0 < ?/ < y' < 2 / 1 , and 

• f{x, y) < f{x, y') for any yi < y < y' < 2/max- 

We assume that / is conic and can be evaluated in constant time in the sequel. 
The following is the optimization problem to be tackled. 

Definition 1 (Finding the best episode pattern according to /). 

Input Two sets S,T C E* of strings. 

Output An episode pattern (v,k) that maximizes the value f{x(^y^k),y{v,k))> 

where X(^y^k) = |S' n fc))| andyi^^^k) = \T C\ L‘^‘{{v,k))\. 

We remark that the problem is NP-hard, since it is a generalization of finding 
the best subsequence pattern [f]. 

From the conicality of function / and the property of episode patterns, we 
can prove the following lemmas. 

Lemma 1 ([4]). For any Q < x < x' < Xmax and 0 < y < y' < ymax, we have 
f{x, y) < max{/(x', y'),f{x', 0), /(O, y'),f{0, 0)}. 

Lemma 2. For any two episode patterns {v, 1) and {w, k), if v w and I > k 
then L‘<’f{v,l)) A L‘^‘{{w,k)). 

By Lemma 1 and 2, we have the next lemma, that plays a key role in our 
algorithm which will be described in Section 4. 

Lemma 3. For any two episode patterns {v, 1) and {w, k), if v w and I > k 
then f{x^.uj,k) , y(w,k) ) < max{/(x(„,i) , y (v ,i)) , f {x (v ,i ) , 0), /(O, ?/(«,;)), /(O, 0)}. 
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Fig. 1. EDASGit), where t = aabaababb. Solid arrows denote the forward edges, and 
broken arrows denote the backward edges. 



3 Episode Directed Acyclic Subsequence Graphs 

We first analyze the complexity of episode pattern matching: given an episode 
pattern {v,k) and a string t, determine whether t G fc)) or not. This 

problem can be answered by filling up the edit distance table between v and t, 
where only insertion operation with cost one is allowed. It takes 0(mn) time 
and space using a standard dynamic programming method, where m = |u| and 
n = |t|. 

For a fixed string, automata-based approach is useful. We use the Episode 
Directed Acyclic Subsequence Graph (EDASG) for string t, which was recently 
introduced by Troicek in [8] . A Directed Acyclic Subsequence Graph (DASG) [2] 
for a string t is a finite automaton that accepts all subsequences of t. An EDASG 
is a directed graph which combines two DASGs for t and the reversed string t^. It 
contains two kinds of edges, forward edges corresponding to DASG(t), and back- 
ward edges corresponding to DASG(t^). As an example, EDASG{aabaababb) is 
shown in Fig. 1. When examining if an episode pattern {abb, A) matches with t 
or not, we start from the initial state 0 and arrive at state 6, by traversing the 
forward edges spelling abb. It means that the shortest prefix of t that contains 
abb as a subsequences is t[0 : 6] = aabaab, where t[i : j] denotes the substring 
ti+i ■ ■ - tj of t. Moreover, the difference between the state numbers 6 and 0 corre- 
sponds to the length of matched substring aabaab of t, that is, 6 — 0 = |aa6aa6|. 
Since it exceeds the threshold 4, we move backwards spelling bba and reach state 
1. It means that the shortest suffix of t[0 : 6] that contains abb as a subsequence 
is t[I : 6] = abaab. Since 6 — I > 4, we have to examine other possibilities. It is 
not hard to see that we have only to consider the string t[2 : *]. Thus we con- 
tinue the same traversal started from state 2, that is the next state of state 1. 
By forward traversal spelling abb, we reach state 8, and then backward traversal 
spelling bba bring us to state 4. In this time, we found the matched substring 
t[4 : 8] = abab which contains the subsequence abb, and the length 8 — 4 = 4 
satisfies the threshold. Therefore we report the occurrence and terminate the 
procedure. 

With the use of EDASG{t), episode pattern matching can be answered 
quickly in practice, although the worst case behavior is still 0{mn). An on- 
line linear-time algorithm for constructing EDASG{f) for a string t G E* was 
proposed in [8]. 
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For strings v,t G E* ,we define the threshold value 9 of v for thy 9 = min{k G 
Af \ t G L“^^{{v,k))}. If no such value, let 9 = oo. Note that t ^ L“^^{{v,k)) for 
any k < 9, and t G k)) for any 0 < fc. It is not difficult to see that the 

EDASGs are useful to compute the threshold value of v for a fixed t. We have 
only to repeat the above forward and backward traversal up to the end, and 
return the minimum length of the matched substrings. 

From now on, for a set S of strings and a string v, we consider the numerical 
sequence {xk}'^^Q, where Xk = ISTl /c))|. It clearly follows from Lemma 2 
that the sequence is non-decreasing. Moreover, notice that 0 < Xk < |S'| for any 
fc, and xi = xi+i = xi +2 = ■ ■ ■, where I is the length of the longest string in S. 
It implies that consists of at most minus'!, ?} distinct values. Hence we 

can represent as a list of pairs (k,Xk) such that Xk-i yf Xk- The length 

of the list is bounded by min{|S|,/}. We call this list a compact representation 
of the sequence for short). 

We now show how to compute CRS for each v and a fixed S. Observe that Xk 
increases only at the threshold values of v for some t G S. For each string ti G S, 
we compute the threshold value 9i of v for ti, and sort these threshold values in 
increasing order. From these sorted values, we can construct the CRS in linear 
time. To be summarized, if we use the counting sort, we can compute the CRS 
for w e A* in 0(|S'|m^ -I- jS'j) = 0(||S'||m) time where m = |w|. We emphasize 
that the time complexity of computing the CRS of {xk}^^Q is the same as that 
of computing Xk for a single k {0 < k < oo), by our method. In the next section, 
we use a data structure StringSet which supports the method to compute the 
CRS for any given string v. 



4 Algorithm 

The basic structure of the algorithm is similar to that in [4] . 

Fig. 2 shows our algorithm to find a best episode pattern from given two sets 
of strings, according to the function /. Optionally, we can specify the maximum 
length of episode patterns by the parameter £. Here, we use a data structure 
PriorityQueue that supports the following methods. 

— bool emptyO : return true if the queue is empty. 

— void push(string w, double priority) : push a string w into the queue with 
priority priority. 

— (string, double) pop() : pop and return a pair {string, priority), where 
priority is the highest in the queue. 

At line 16 marked by (*), we can simultaneously compute k' and val by 
using CRSs x and y in 0(|a;| -I- jyj) time. By Lemma 3, we can use the value 
upperBound to prune branches in the search tree computed at line 20 marked by 
(**). Note that X(^y^ao) and y{v,oo) can be extracted from x and y in constant time, 
respectively. The next theorem guarantees the completeness of the algorithm. 

Theorem 1. Let S and T be sets of strings, and £ he a positive integer. The 
algorithm FindBestEpisode (S , T, i) will return an episode pattern that maxi- 
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1 string FindBestEpisode(StringSet S, T, int £) 

2 string prefix, v; 

3 episodePattern maxSeq-, /* pair of string and int */ 

4 donble upper Bound = oo, maxVal = —oo, vab, 



/* CRS */ 

/* Best First Search*/ 



5 int k ' ; 

6 CompactRepr x, y\ 

I Prior ityQuene queue, 

8 queue. push/" , oo); 

9 while not queue. empty/ do 

10 {prefix, upperBound) = queue. pop/-, 

II if upperBound < maxVal then break; 

12 foreach c G E do 

13 V = prefix+ c; /* string concatenation */ 

14 X = S.crs{v)-, 

15 y — T.crs(u); 

16 (*) k' = argmaxj.{/(a;(„,fc), 2 /(„,fc))} and val = /(*<„,*/>, 

n if val > maxVal then 

18 maxVal = val, 

19 maxEpisode = {v,k'); 

20 (**) upperBound = max{/(a:(„,oo> , J/(«,oo>), /(a:(„,cx>> , 0), 

/(0,2/(„,oc>),/(0,0)}; 

21 if upperBound > maxVal and |u| < £ then 

22 queue. push{v, upperBound); 

23 retnrn maxEpisode; 



Fig. 2. Algorithm FindBestEpisode. In onr pseudocode, the break statement is to 
jnmp out of the closest enclosing loop. 



mizes f{x(^y^k),y(v,k)), withx(^^^k) = \Sr]L‘”‘{{v,k))\ andy(^y^k) = \Tr]L‘’'/{v, k))\, 

where v varies any string of length at most £ and k varies any integer. 



5 Conclusion 

We developed a practical algorithm to find the best episode pattern to sepa- 
rate given two sets of strings. Episode pattern is a generalization of subsequence 
pattern, and the search space of episode patterns is much larger than that of sub- 
sequence patterns. Nevertheless, our algorithm enabled to find the best episode 
pattern efficiently: the running time will not be much slower than that for finding 
subsequence patterns. 

It is challenging to apply our approach to find the best pattern in the sense 
of pattern languages introduced by Angluin [1], where the related consistency 
problems are shown to be very hard [6] . Fujino et al. showed an another approach 
to find the best proximity pattern [3]. It may be interesting to combine these 
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approaches into one. We are now in the process of installing our algorithm into 

the core of the decision tree generator in the BONSAI system [7]. 
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Abstract. Widespread interest in discovering features and trends in time- series 
has generated a need for tools that support interactive exploration.This paper intro- 
duces timeboxes: a powerful direct-manipulation metaphor for the specification of 
queries over time series datasets. Our TimeSearcher implementation of timeboxes 
supports interactive formulation and modification of queries, thus speeding the 
process of exploring time series data sets and guiding data mining. 



1 Introduction 

Interest in time series data has prompted a substantial body of work in the development 
of algorithmic methods for searching temporal data [1,5]. These methods would be 
more widely employed if the difficulty of query formulation was reduced. In order to 
build understanding of time series data users need tools that support data exploration via 
easy construction of queries and rapid feedback (100ms) [7]. 

Dynamic queries [2]and related information visualization techniques [4] have 
proven useful in meeting these goals. This paper introduces timeboxes: a dynamic query 
mechanism for specifying queries on temporal data sets. 

2 Related Work 

Data mining research has led to the development of useful techniques for analyzing 
time series data, including dynamic time warping [10] and Discrete Fourier Transforms 
(DFT) in combination with spatial queries [5] . To date, this work has paid little attention 
to query specification or interactive systems. One exception is Agrawal et al.’s Shape 
Definition Language, which specifies queries in terms of natural language descriptions 
of profiles [1]. Support for progressive refining of queries was addressed by Keogh and 
Pazanni, who suggested the use of relevance feedback for results of queries over time 
series data [6]. Our work with timeboxes is aimed at developing tools to address issues 
of user interaction with these data mining tools. 

Existing time series visualizations tools generally focus on visualization and naviga- 
tion, with relatively little emphasis on querying data sets. Query Sketch is an innovative 
query-by-example tool that uses an easily drawn sketch of a time series profile to retrieve 
similar profiles, with similarity defined by Euclidean distance [9]. Spotfire’s Array Ex- 
plorer 3 [8] supports graphically edit-able queries of temporal patterns, but the result 
set is generated by complex metrics in a multidimensional space. 



K.P. Jantke and A. Shinohara (Eds.); DS 2001, LNAI 2226, pp. 441^46, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




442 



H. Hochheiser and B. Shneiderman 



3 Timeboxes: Interactive Temporal Queries 

Timeboxes are rectangular query regions drawn directly on a two-dimensional display 
of temporal data. The extent of the timebox on the time (x) axis specifies the time period 
of interest, while the extent on the value (y) axis specifies a constraint on the range of 
values of interest in the given time period. More specifically, a timebox that goes between 
und {xmax^VTuax) indicates that for the time range x^min ^ ^ ^ ^max> 
the dynamic variable must have a value in the range ymin < y < Umax ■ 

Timeboxes are created, moved, and resized using rectangle manipulation operations 
familiar to users of drawing and presentation software. Multiple timeboxes can be com- 
bined to specify conjunctive queries. 




Fig. 1. Query containing multiple timeboxes 



Table 1. Constraints for query shown in Fig. 1 



Vsep<x<nov 5T ^ ^ ^ 160 


Vdec<.<feb 124 < y < 230 


154 < t/ < 291 


V2;=apr 58 < y < 266 


Vmay<x<jul 46 ^ ^ ^ 162 


Vaug<x<sep 0 ^ ^ 101 



Fig. 1 provides an example query containing multiple timeboxes. In addition to being 
succinct and easy to create, the timebox version of this query provides a visual picture 
of the constraints that is not apparent in other notations. For example, the query in Fig. 
1 is more easily interpreted than the mathematical expression of the same constraints 
(Table 1), which is cognitively more difficult for users to comprehend. 

4 TimeSearcher 

4.1 Overview 

The main TimeSearcher window is shown in Fig. 2. Entities in the data set are displayed 
in a window in the upper left-hand corner of the application. This provides a scrollable 



Interactive Exploration of Time Series Data 



443 



list that can be used to browse through the data. Complete details about the entity (details- 
on-demand) can be retrieved by simply clicking on the graph for the desired entity: this 
will cause the relevant information to be displayed in the upper right-hand window (Fig. 
2). 




Fig. 2. TimeSearcher, displaying a query with two timeboxes and four of the five records in the 
result set 



4.2 Query Creation and Modification 

Queries are created in the query space in the bottom-left corner of the window. To specify 
a query, users draw a timebox in the desired location. Query processing begins as soon 
as users release the mouse, signifying the completion of the box. No “run” or “query” 
button is necessary because of the rapid update (a few hundred milliseconds). When 
query processing completes, the display in the top half of the application window is 
updated to show those entities that match the query constraints. 

Rapid and dynamic update of the result set display provides prompt feedback re- 
garding the results of the query. Once the initial query is created, query parameters can 
be changed by moving and resizing the timeboxes, either individually or simultaneously 
in groups. 
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4.3 Drag and Drop 

Users might be interested in identifying entities that have profiles similar to a given tem- 
plate or example from the data set. TimeSearcher provides a drag-and-drop mechanism 
that can be used to identify items similar to a given example from the data set. The user 
can instantiate a query by dragging an item from the data display window and dropping 
it onto the query space. The resulting query has a separate timebox for each time point 
in the data set (Fig. 3). Once the query is created, the user can modify the timeboxes to 
modify the definition of "similar". 



m 



Time Finder ...13 Months of High and Low prices 






Fk! ElM FMp 




Fig. 3. Drag-and-drop query-by-example 



4.4 Envelopes for Overviews 

TimeSearcher uses envelopes to provide overview displays to help users make sense of 
large data sets [4,7]. Optionally shown in the background of the query window, the data 
envelope is a contour that follows the extreme values of the query attribute at each point 
in time, thus displaying the range of values that may be queried. When the user executes 
a query, the data envelope is extended by a query envelope - an overlay that outlines 
extreme values of the entities in the result set (Fig. 4). This display provides users with 
a graphic summary of the relationship between the result set and the data set as a whole. 
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Fig. 4. Data and query envelopes for a query with two timeboxes 



5 Software 

TimeSearcher was implemented in Java 2, using the Swing toolkit. Drawing and scene- 
graph control in the data and query displays, along with functionality for moving and 
rescaling timeboxes, is provided by Jazz [3]. Timeboxes, graphs of each item, and query 
and data envelopes are implemented as Jazz widgets. 

Orthogonal range trees are used to index the data, with each timebox acting as an 
orthogonal range query. In this model, each timebox is an orthogonal range query of 
width w, and an entity from the data set must have w points that fall within the query 
range to be included in the result set for the query. 



6 Discussion and Future Work 

TimeSearcher users an “overview-first” [7] approach to the exploration of time series 
data. The data and query envelopes, together with the linear list of graphed elements, 
provide the necessary overview. Each timebox is a new filter that restricts the data set 
resulting from the query formed by the pre-existing timeboxes. Query processing on 
mouse release follows a model familiar to users of modern GUIs, whereby a mouse 
release is treated as completion of user input. 

Several extensions to the timebox model might increase the range of queries that 
can be expressed. Queries involving events of fixed duration occurring at any point in 
time, events that are separated by minimum gaps in time, disjunctions and negations, 
trends involving relative changes (“increase of more than 50% within a given period”) 
and multiple time-dependent attributes might be of interest. 

Further gains in efficiency might be realized by using timeboxes to specify queries to 
be evaluated with existing data mining algorithms such as those described by Faloutsos, 
et al. [5]. In this model, TimeSearcher might be used to interactively search subsets of 
a larger data set, in order to refine queries that might be executed against the entire data 
set, using the more expensive data mining algorithms. 
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7 Conclusions 

TimeSearcher uses dynamic queries, overviews, and other information visualization 
techniques that have proven useful in a variety of other domains [2,4,7] to support 
interactive examination of time series data. Timeboxes represent an extension of the 
dynamic query idea to include widgets that query multiple dimensions simultaneously, 
as each timebox specifies constraints over two dimensions. 

The incorporation of data mining algorithms into systems that support exploration 
and interactive knowledge discovery is the next step in making data mining more acces- 
sible to a wider range of users and problem domains. A more diverse user population will 
also stimulate more research, as these users generate questions and problems involving 
further algorithmic challenges. 

The utility of timeboxes will be a function of the usability of the interface, particularly 
in comparison with alternative approaches. Empirical studies and heuristic evaluations 
are needed to clarify the benefits and drawbacks of timeboxes, while suggesting addi- 
tional interface improvements. 

Acknowledgments. Thanks to Martin Wattenberg for providing stock price datasets, 
and to Eric Baehrecke and Hyunmo Kang for valuable feedback. The first author was 
supported by a fellowship from America Online. 
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Abstract. We consider the problem of pruning a given set of if-then rules, such 
that the support of the pruned rule set is not much less than the support of the 
given rule set. An empirical measure of similarity between two rules is 
introduced. This similarity measure is proportional to the degree of overlap 
between the support sets of the two rules. Using this similarity measure, we 
cluster the given rule set via the complete linkage algorithm. Rules within a 
cluster are approximate substitutes for each other and, as such, they can be 
replaced by a single rule, which is chosen to be the rule whose individual 
support value is the largest in the cluster. The pruning procedure is 
demonstrated on a set of rules generated from a marketing data set. 



1 Introduction 

If-then rules are arguably the most popular product of data mining in business. They 
can be generated in an unsupervised manner as association rules [1], [2] or in a 
supervised manner as classification rules [3]. Insights discovered from such rules are 
used in database marketing to effectively target products and promotions, in 
personalization technologies such as recommender systems to increase cross-selling 
and up-selling, and in designing better shelf layouts in retail outlets [4]. 

Several methods are available for generating association and classification rules. 
Regardless of the method of production, the number of rules generated is typically too 
large to allow for direct managerial action. Rule mining systems sift through the 
initial set of rules extracted and seek to retain only a subset of useful rules. 

In this paper, we consider the problem of pruning a given set of rules such that 
there is only a minimal loss of support with respect to a given data set. Our pruning 
procedure uses cluster analysis to remove redundant rules 

Our approach is characterized by two important features. First, unlike much of the 
research on rules which focuses on performance measures of individual rules (such as 
rule support, rule coverage, rule confidence, and rule simplicity), we consider a 
performance measure for the rule set, namely, set support. This is an important 
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distinction because aggregating rules that are individually good does not guarantee 
good performance properties for the aggregated rule set. 

Second, our approach employs cluster analysis in a very different fashion than 
previous uses of clustering in rule pruning. For example, in [5], clustering is used to 
combine rules based on similarities in their logical construction. On the other hand, 
our use of clustering is based on similarities in the support sets of the rules being 
clustered. If two rules point to the same class (i.e., same consequent value), we 
consider them to be substitutes for each other if they act on the same (or similar) sets 
of data records, regardless of the similarity or dissimilarity in their logical 
construction. Thus, our approach is strongly empirically driven. 

Our approach can be described as a three-step procedure. The first step is to 
partition the initial set of rules into groups based on the predicted consequent. In the 
next step, we partition each such rule group into distinct clusters such that antecedents 
of rules within a given cluster point to similar record sets. This requires the definition 
of a similarity measure for rules, which we provide below. Since the record sets of 
their antecedents are similar and the record sets of their consequents are identical, it 
follows that the support sets of the various rules within a cluster are close to each 
other. Consequently, such rules can be viewed as substitutes for each other. In the 
third step, rules within each cluster are ranked on the basis of their support values and 
only the top rule is retained. The final rule set is obtained by aggregating the top rules 
from the various clusters. By construction, such a rule set will approximately span 
the support set of the initial set of rules, while being of a much smaller size. 



2 Methodology 

Let D be the data set of records on which the rules are to be applied. Let the t rule be 
of the form R. : X.^Y., where X. is the antecedent and Y. is the consequent. Let A. 
denote the record set of the antecedent X. (i.e., the subset of records in D that satisfy 
X), and let C denote the record set of the consequent T. Let S. = A. n C denote the 
support set of R., i.e., the set of records in D where the rule R. applies and is true. Let 
#{S) denote the cardinality of any set S. The support value of a rule R. is 
The coverage of R. is #(A.)/#(D). The confidence of R. is #(5)/ #(A.). 

Consider a rule set R containing n rules R^, R^, . . . , R^ such that they all have a 
common consequent. The support set of rule set R is S = S^'uS^'U. . . and the 
support value of rule set R is 

Consider two rules R. and Rj. We define sim(i, j), a bounded measure of the 
similarity between R. and R., as 



sim(/,y) = #(5, n S) / min{#(S) , #(5.)} . 



If two rules have different consequents (i.e., different predicted classes), then their 
support sets will be disjoint and therefore our similarity measure will be 0. If two 
rules have the same consequent, then our similarity measure indicates the degree of 
overlap between the record sets of their antecedents. If the record sets of their 
antecedents are disjoint, then their similarity will be 0. At the other extreme, if the 
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record set of one of the antecedents is a subset of the record set of the other 
antecedent, then their similarity will be 1 . 

Given a set of rules, we first sort them on the basis of their consequents. Typically, 
the consequent has two possible values (buy, no-buy), although several values must 
be considered when predicting the purchase of multiple products. Consider the set of 
rules having a particular value of the consequent. For such a set, we may still have 
several hundred unique rules. Since these rules cannot differ in their consequent, they 
differ only in their antecedent. However, there may be significant redundancies when 
there is a large overlap in the record sets of the antecedents. Our similarity measure 
captures the degree of this redundancy. 

For a set of n rules having a common value of the consequent, we construct an nXn 
similarity matrix based on our measure. Next, we use cluster analysis [6] based on 
this similarity matrix to partition the rule set into disjoint clusters. This can be 
implemented via a hierarchical agglomerative technique such as one of the various 
linkage algorithms. In particular, the complete linkage algorithm is known to produce 
compact clusters. 

Selecting the number of clusters to be formed is a sensitive process. Since 
hierarchical clustering methods indicate the similarity level at which various clusters 
are merged, we can control the final number of clusters by specifying a threshold 
similarity level. For example, if this threshold level is set at 0.3, then the number of 
clusters formed is such that the similarity between any two clusters is at most 0.3. 

Rules within a cluster are guaranteed to be similar to each other in the sense that 
the record sets satisfying their antecedents are significantly overlapping. (They also 
have a common consequent by construction.) As such, they can be replaced by a 
single rule from the cluster, or more generally, by a smaller number of rules than the 
cluster size. 

The next step is to select a single rule (or a small number of rules) from each 
cluster. To accomplish this, we sort the rules within each cluster on the basis of their 
support measure and retain only the top few rules from each cluster. 

Finally, the retained rules from each cluster are aggregated into a final rule set. 
Since each cluster corresponds to a distinct region in the data space (or the space of 
customer attributes) with respect to the antecedents, we have retained at least one rule 
for each populated part of the space of customer attributes. Thus, the space of 
customer attributes is effectively spanned by the smaller set of final retained rules. 



3 Example 

We demonstrate our methodology to reduce the number of classification rules 
obtained from a large marketing data set. The goal of all rules produced from this 
data set is to determine whether a given customer will re-use a particular product or 
not. The data set consists of 438,808 records, of which 21.63% were re-users. A 
training set of 60,000 records was created by random sampling. Using the popular 
C4.5 program [7], an initial rule set of 90 rules was extracted from the training set. 
All rules indicate re-use by the customer. A similarity matrix was constructed for 
these 90 rules using the similarity measure defined in Section 2. Complete linkage 
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Table 1. Performance measures of successively pruned rule sets 



Size of 
Rule Set 


Support 


Coverage 


Confidence 


Training 


Test 


Training 


Test 


Training 


Test 


90 


0.0775 


0.0762 


0.1586 


0.1587 


0.4878 


0.4804 


69 


0.0769 


0.0757 


0.1571 


0.1569 


0.4896 


0.4823 


45 


0.0736 


0.0729 


0.1471 


0.1470 


0.5004 


0.4960 


35 


0.0711 


0.0706 


0.1403 


0.1402 


0.5071 


0.5017 


25 


0.0687 


0.0680 


0.1337 


0.1339 


0.5137 


0.5083 


20 


0.0678 


0.0671 


0.1309 


0.1311 


0.5180 


0.5116 


12 


0.0621 


0.0616 


0.1155 


0.1157 


0.5378 


0.5324 



clustering was performed on this similarity matrix with a scaled threshold similarity 
level of 0.5. This resulted in 69 clusters, of which 57 clusters contained a single rule, 
while the other 12 contained more than one rule. The rules in each of the 69 clusters 
were sorted on the basis of their individual support values, and the rule with the 
maximum support was retained from each cluster. 

These 69 rules were then further pruned by dropping rules with low values of 
support. In turn, we looked at nested rule sets containing 45, 35, 25, 20, and 12 rules. 
For each rule set, we computed the support, coverage, and confidence values on the 
training set of 60,000 records, as well as on a test set of 10,000 records randomly 
sampled from the original data set. These values are displayed in Table 1. The 
performance measures for the test set are charted in Figure 1 . 

We observe that the numbers for the test set show a similar pattern to those for the 
training set. In comparing the initial set of 90 rules with the post-clustering rule set of 
69 rules, we see that the pruning of 21 rules (23% of the initial rule set) is 
accompanied by only a small drop in the value of support. Subsequent rows in Table 
1 show that while further pruning brings the benefit of a smaller rule set, it is 
accompanied by progressively larger drops in support. 

As expected, the coverage of the retained rule set decreases with each pruning step. 
However, the confidence of the retained rule set increases with each pruning step, 
showing that successive retained sets consist of smaller numbers of more accurate 
rules. 



4 Conclusions 

We have presented a method for rule pruning that is based on cluster analysis. 
Instead of looking at the logical construction of various rules to look for similarities, 
we look at the support sets of the rules and look for the degree of overlap. A 
similarity measure that quantifies this degree of overlap is defined. When the rules 
are clustered based on this similarity measure, rules within a cluster are relatively 
redundant with respect to each other since they are used to classify the same set of 
data records. Consequently, each cluster can be represented by a single rule without 
much loss of support. This results in a pruned rule set whose size is much smaller 
than the original rule set but without any significant decrease in support. 
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Fig. 1. Performance measures on test data set for different sizes of rule sets. Support and 
coverage follow scale on left, confidence follows scale on right. 
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Abstract. This paper investigates both the role of fine-grained historical cases 
in developing computational models of techno-scientific thinking and the impact 
of such models for supporting information search and further inventions and 
discoveries. In particular, we investigate Alexander Graham Bell’s invention of 
the telephone and we propose a computational model to explain its essential as- 
pects. We further derive lessons about how such model can be used to build 
human-computer interaction systems that augment the intelligence of users in- 
volved in information search. We conclude that historical data can be used to 
advance cognitive and computational theories of techno-scientific thinking and 
to build better human-information systems. 



1 From the Telephone Invention to WWW Information Search 

This paper investigates the lessons learned from developing computational models of 
techno-scientific reasoning based on fine-grained historical cases of invention. We 
highlight the applicability of such lessons for building intelligent systems to assist 
users in information search, since information search plays a significant role for pur- 
suing new inventions and discoveries. We have chosen Alexander Graham Bell’s 
invention of the telephone as our case study because it is one of the best documented 
and analyzed historical case of invention (e.g., Bruce, 1972; Bell, 1908). 

Our investigation of Bell’s inventions (Gorman, 1998, Simina et al., 1998) identi- 
fied series of general criteria about scientific discovery and invention, which can be 
summarized as follows: 

1. Invention and discovery depend on establishing that a problem is significant 
enough to be labeled an important achievement. 

2. Invention and discovery depend on transforming that problem into a form that 
suggests a promising path to solution. This includes locating and transforming the 
necessary mechanical representations. 

3. Invention and discovery depend on a combination of flexibility and stubbornness, 
depending on the cognitive styles and career trajectories of the inventors involved 
and on how they represent the problem. 
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4. Communication is part of the invention and discovery process. 

5. Successful inventors and scientists often pursue networks of enterprises (open 
problems) that may interact among them. Such interactions are a major source of 
creativity. 

The above general criteria also provide a framework for understanding other cases 
of invention and discovery. While Gorman (1998) shows how these criteria apply in 
the case of several scientists (such as Kepler, Lavoisier and Krebs), Simina (1999) 
proposes a computational model (ALEC) that simulates essential aspects of the tele- 
phone invention, which takes into account all the above criteria. The initial computa- 
tional model (Simina et al., 1998) was based on an in depth analysis of Bell’s inven- 
tions using case-based reasoning as an investigation tool. This analysis helped us to 
identify limitations of existing models of techno- scientific thinking (e.g. the inability 
to reason about several goals in parallel by taking advantage of opportunistic interac- 
tions among them). The initial version of ALEC addressed these limitations. Next, we 
used Gorman’s criteria as additional constraints for refining our model. Simulating the 
invention of the telephone with ALEC helped us to refine Gorman’s generalizations 
and their interpretation at a computational level. In turn, the resulting model can be 
used to better understand techno-scientific reasoning. 

In the end, our computational model characterizes the long-term work of a creative 
reasoner in terms of partially-independent enterprise goals, high-level goals pursued 
in parallel that interact synergistically and evolve incrementally. While we agree that a 
reasoner explicitly addresses only one (current) goal at a time (e.g., Simon, 1989), we 
claim that the other (background) goals pursued in parallel, along with their (partial) 
solutions, provide a reasoning context for advancing solutions for a current goal. 
Moreover, if a reasoner is part of a team, then his reasoning context may include goals 
of other team members, and creative reasoning becomes distributed across the whole 
team. 

Since our model highlights the role of information while pursuing enterprise goals, 
an interesting issue is whether it can be reused to support information search for future 
enterprises (e.g., invention or discovery). In what follows, we briefly present ALEC 
and then we describe Smart Agenda, a computational model for augmenting the intel- 
ligence of users involved in information search. Our objective is to investigate cogni- 
tive and computational models for supporting opportunistic information search on the 
WWW (and/or other large databases) by taking advantage of the lessons learned from 
Bell’s case study. 

2 ALEC 

ALEC’s functional architecture is presented in Fig. 1. According to our previous 
analysis (Simina et al. 1998, Simina 1999), a reasoner may internally pose its own 
enterprise goals (i.e., Enterprise Posing), or it may be interested in adopting exter- 
nally-posed enterprise goals. After a new enterprise is posed, an Enterprise Adoption 
process must identify which of these posed enterprises are worth pursuing in the cur- 
rent context as active enterprises, which have to be postponed (by suspending them in 
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memory), and which should be ignored. A reasoner addresses an enterprise by concur- 
rently evolving its specification and a pool of alternative solutions relying on his pre- 
vious experience. This involves three processes: (1) Evolve Specification, (2) Evaluate 
and Critique, and (3) Evolve Solutions, which are together responsible for Enterprise 
Processing. 



Enterprise Generation 




Enterprise Processing 



Fig. 1. ALEC: Functional Architecture 

Each of the above processes makes inferences based on knowledge retrieved from 
the Current Context (knowledge and goals accessed recently) and/or from Previous 
Experience. The Retrieval* algorithm is responsible for performing a fine-grain con- 
tent-based retrieval from the Current Context, which simulates priming effects. If 
Current Context retrieval is unsuccessful. Retrieval* performs an index-based retrieval 
from the Previous Experience repository (Kolodner, 1993). The index-based retrieval 
simulates free recall from long-term memory. To simulate opportunistic reasoning, 
suspended enterprise-goals can be indexed in the Previous Experience repository in 
terms of the missing knowledge that prevents finding solutions for the suspended 
goals. A detailed computational model can be found in Simina (1999). 

The Enterprise Processing processes are directly relevant for information search. 
The next section presents Smart Agenda, a tool for supporting information search, 
which takes advantage of ALEC’s opportunistic reasoning architecture. 



3 Smart Agenda 

Information search became a significant issue with the widespread reliance on the 
World Wide Web as an information delivery medium. Since existing deployed tech- 
nologies for information classification and search (e.g., portals such as Yahoo) fail to 
provide the leverage needed to transform the massive amounts of new information into 
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knowledge (Furuta & Papakonstantinou, 1999), information search remains a signifi- 
cant method to access information in the WWW. Unfortunately, traditional search 
engines do not take into account many implicit aspects of the search process, such as 
the search context (e.g., the user’s profile, the current problem that he is investigating 
and his other suspended goals), and consequently in many cases search engines fail to 
provide the right information at the right time. In such cases, people may take advan- 
tage of their previous experience to reformulate the initial query or to suspend the 
current information goal and resume it later, when relevant search knowledge becomes 
available. Intelligent search agents should be able to manifest similar adaptive behav- 
ior. If previous search experience is essential to evolve the information goals of a user 
(e.g., queries using a standard search engine), then the issue is how to capture this 
previous search experience and how to reuse it. Since the WWW is also a repository 
for a growing number of portals that manually encode search experience in limited 
domain, then a research hypothesis is that this manually encoded knowledge can be 
automatically captured by web agents, using methods similar with those described by 
Heflin & Hendler (2001), and reused to guide the evolution of the user’s information 
goals. Such an approach takes advantage of the information that is already available 
on the web and transforms it into search knowledge. But there is no guarantee that the 
prerequisite knowledge for evolving an information goal is always available. In such 
situations people rely on opportunistic reasoning to suspend the current information 
goal and resume it later when information about how to evolve it becomes available 
(e.g., Simina 1999). The issue is how to discover that some current piece of informa- 
tion is relevant for a suspended goal. Then, a new search pattern connecting the sus- 
pended goal and the discovered information can be learned for future reuse. The sec- 
ond hypothesis is that opportunistic reasoning (see Figure 1) can help a reasoner to 
discover new search knowledge when traditional search methods fail. Beside provid- 
ing a cognitive framework for understanding and supporting complex information 
retrieval, opportunistic information search provides a foundation for integrating exist- 
ing tools developed for information retrieval. 



Document index 



Current fragment 



Dynamic index 




Indexed ^ Suspended 

documents Goais 



Fig. 2 . Opportunistic information search 
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4 Related Work 

Existing deployed tools for information retrieval include search engines, manually 
structured indices and knowledge bases (e.g., portals like Yahoo) and just-in-time 
information retrieval engines (JITIR; Rhodes & Maes, 2000). Each method has its 
own inherent limitations. Search engines may retrieve too many documents, most of 
them irrelevant to the current problem. The issue is how to reformulate the query to 
retrieve documents relevant to the current problem. Manually structured knowledge 
bases are designed to address well common queries, by suggesting some categories at 
every step of incremental query reformulation (e.g., Yahoo, Ask Jeeves). The issue is 
how to select a category for an unusual query. JITIR is a proactive technique that 
automatically retrieves documents relevant to the current task of the user. The user 
does not have to explicitly articulate his goal, and the JITIR engine may retrieve many 
documents that may be relevant for the current task or environment, but not necessar- 
ily for the implicit goal of the user. 

One way to address the above limitations is to identify how people perform infor- 
mation search. Previous experience plays an important role in successful information 
search, and some researchers proposed cased-based systems that capture the expertise 
associated with information search (e.g., Jaczynski & Troussse, 1998; Leake et al., 
2000). Such systems capture and store past navigation patterns of a group of users, and 
reuse past patterns that (partially) match the current search context to suggest what 
documents to examine next. However, in many cases it is not necessary to build navi- 
gation pattern libraries from scratch. Many existing web portals implicitly contain 
navigation knowledge and the only issues are: (1) to identify such portals and (2) to 
automatically extract navigation patterns and store them in a case library. 

But what kind of support can be provided to a user when previous libraries of navi- 
gation patterns do not exist? People rely on opportunistic reasoning in such situations, 
i.e., they suspend the current information goal and they resume it when knowledge 
relevant to the suspended information goal becomes available. Previous research in 
opportunistic reasoning did not investigate information search. Just-in-time IR (Rho- 
des & Maes 2000) addresses only the issue of retrieving documents relevant for the 
current (implicit) information goal. A computational tool can keep track of the (sus- 
pended) information goals of a user and check opportunistically if the current docu- 
ment, relevant for the current information goal, is also relevant for any suspended 
goals. Currently we are experimenting with a prototype of Smart agenda built on top 
of the software described in Rhodes & Maes (2000). 



5 Conclusions 

Cognitive frameworks and methods applied to fine-grained historical case studies can 
add rigor to those analyses. Gruber (1974) proposed the framework of network of 
enterprises to explain Darwin’s creativity. Gorman added his general criteria to Gru- 
ber’s and showed how Gruber’s analysis applies across invention and discovery cases 
(Gorman, 1998). But only with the addition of a computational model have we been 
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able to begin to understand the processing underlying goal suspension and the condi- 
tions for their reactivation (Simina 1999). This insight can now be applied both to 
understand fine-grained details of other historical case studies and to build intelligence 
augmentation tools (e.g., by supporting information search) to afford future inventions. 
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Abstract. Various research fields in science and technology are now 
accumulating large amounts of data in databases, using recently devel- 
oped computer controlled efficient data-acquisition tools for measure- 
ment, analysis, and observation. Researchers believe that such a huge 
extensive data accumulation in databases will allow them to simulate 
various physical, chemical, and/or biological phenomena on computers 
without carrying out any time-consuming and/or expensive real experi- 
ments. Information visualization for DB-based simulation requires each 
visualized record to work as an interactive object. Current information 
visualization systems visualize records without materializing them as in- 
teractive objects. Researchers in these fields develop their individual or 
community mental models on their target phenomena, and often like to 
visualize information based on their own mental models. We will propose 
in this paper a generic framework for developing virtual materialization 
of database records based on the component-ware architecture Intelli- 
gentBox that we developed in 1995 for 3D applications. This framework 
provides visual interactive components for (1) accessing databases, (2) 
specifying and modifying database queries, (3) defining an interactive 
3D object as a template to materialize each record in a virtual pace, 
and (4) defining a virtual space and its coordinate system for the infor- 
mation materialization. These components are represented as boxes, i.e., 
components in the IntelligentBox architecture. 



1 Introduction 

Recently, extensive application of information technologies in various social ac- 
tivities such as production, distribution, sales, finance, communication, trans- 
portation, education, and welfare has enabled us to file large amounts of personal 
records in these social activities and to store them in databases. Although their 
use should not violate individual’s privacy, they contain various useful knowledge 
resources that may not violate any privacy. Information visualization technolo- 
gies as well as data mining technologies aim to support people to extract such 
knowledge resources. Most of the current information visualization systems pro- 
pose various specific visualization schemes, assuming typical application fields 
and typical analysis methods in these fields. Some information visualization sys- 
tems partially allow us to interactively define visualization schemes. These in- 
clude Tioga 2 (Tioga DataSplash)[l], Visage [2], and DEVise[3]. They only allow 
us to make a selection out of a priori provided libraries of visualization schemes. 
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Various research fields in science and technology are now accumulating large 
amounts of data in databases, using recently developed computer controlled 
efficient data-acquisition tools for measurement, analysis, and observation. Re- 
searchers believe that such a huge extensive data accumulation in databases will 
allow them to simulate various physical, chemical, and/or biological phenom- 
ena on computers without carrying out any time-consuming and/or expensive 
real experiments. Here in this paper, we call such a new way of research in sci- 
ence ’data-based science’. Information visualization will be by no doubt one of 
the most powerful tools in data-based science. Current information visualization 
technologies, however, do not satisfy the requirements in data-based science. 

Information visualization for DB-based simulation requires each visualized 
record to work as an interactive object. It should be easy enough for these 
researchers, who are not necessarily computer experts, to define the functionality 
of each visualized record as well as the spatial record arrangement. Current 
information visualization systems visualize records without materializing them 
as interactive objects. Instead of information visualization systems, we need 
an information materialization framework that allows us to materialize each 
record as an interactive visual object in a virtual space. Furthermore, researchers 
in these fields develop their individual or community mental models on their 
target phenomena, and often like to visualize information based on their own 
mental models. We need to provide these researchers with a new visualization 
environment in which they can easily define their own visualization schemes as 
well as various query conditions. 

We will propose in this paper a generic framework for developing virtual 
materialization of database records based on the component-ware architecture 
IntelligentBox that we developed in 1995 for 3D applications. This framework 
provides visual interactive components for (1) accessing databases, (2) specifying 
and modifying database queries, (3) defining an interactive 3D object as a tem- 
plate to materialize each record in a virtual pace, and (4) defining a virtual space 
and its coordinate system for the information materialization. These components 
are represented as boxes, i.e., components in the IntelligentBox architecture. 



2 IntelligentBox Architectnre 

IntelligentBox is a component-ware system for developing 3D interactive 
applications [4]. It is a 3D extension of a 2D meme media architecture 
IntelligentPad[5] [6]. It calls components boxes. Boxes may have arbitrary in- 
ternal functions as well as arbitrary 3D visual display functions. IntelligentBox 
provides a dynamic functional composition mechanism that enables us to geo- 
metrically and functionally combine 3D objects through direct manipulation on 
the screen to compose a complex 3D object. Only primitive component boxes 
need to be programmed by their developers. Composite boxes are also simply 
referred to as boxes unless this causes any confusion. Each box is logically mod- 
eled as a list of slots, each of which can be accessed by either a ’set’ message or a 
’gimme’ message. Corresponding to each slot, a box has two procedures that are 
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respectively invoked by a ’set’ and a ’gimme’ message. In addition to these slots, 
each box has its properties such as its dimension, its orientation, and its angle. 
A box may define some of these properties as its slots, which allows other boxes 
to change those properties through their slot connection linkages to this box. A 
RotationBox, for example, has a cylinder shape, and rotates itself corresponding 
to user operations. It has a slot named #ratio whose value changes from 0.0 to 
1.0 in proportion to its rotation angle. It rotates when the #ratio slot is set with 
a new value. Its direct manipulation changes not only its rotation angle but also 
its #ratio slot value. 

A box can be connected to a single slot of no more than one other box. The 
former becomes a child box of the latter, while the latter is called a parent box 
of the former. Each child box is managed by the coordinate system defined by 
its master box. IntelligentBox allows us to make any box invisible. The child can 
access the connected slot of its parent by either a ’set’ message or a ’gimme’ 
message. A ’set’ message takes one parameter, while a ’gimme’ message has no 
parameter. The parent box can send an ’update’ message to its child boxes. This 
message takes no parameter. In their default definitions, a ’set’ message writes 
its parameter value into the corresponding slot register in the parent box, while 
a ’gimme’ message reads the value of this slot register. An ’update’ message tells 
the recipient that a state change has occurred in the parent box. In addition to 
these three standard messages, each box can accept geometrical messages such 
as ’resize’, ’move’, ’copy’, ’hide’, and ’show’. 

3 Information Materialization through Query 
Composition 

Our framework provides visual interactive components as boxes for (1) access- 
ing databases, (2) specifying and modifying database queries, (3) defining an 
interactive 3D object as a template to materialize each record in a virtual pace, 
and (4) defining a virtual space and its coordinate system for the information 
materialization . 

Figure 1 shows an example composition for information materialization. It 
specifies the above mentioned four functions as a flow diagram from left to right. 
The leftmost box is a TableBox, which allows us to specify a database relation to 
access; it outputs an SQL query with the specified relation in its ’from’ clause, 
leaving its ’select’ and ’where’ clauses unspecified. The database is stored in 
a local or remote database server running an Oracle DBMS. When diked, a 
TableBox pops up the list of all the relations stored in the database, and allows 
us to select one of them. 

The second box is a TemlateManagerBox, which allows us to specify a com- 
posite box used as a template to materialize each record. It allows us to register 
more than one templates, and to select one from those registered for record 
materialization. When we select a template named t, the TemplateanagerBox 
adds a virtual attribute, ’t’ as TEMLATENAME, in the ’select’ clause of the 
input query, and outputs the modified SQL query. The database has an addi- 
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Fig. 1. An example composition for information materialization 



tional relation to store the registered templates. This relation TEMPLATEREL 
has two attributes; TEMPLATENAME and TEMPLATEBOX. The second at- 
tribute stores the template composite box specified by the first attribute. In 
the later process, the specified SQL query is joined with the relation TEM- 
PLATEREL to obtain the template composite box from its name. When we 
register a new template composite box, the TemplatManagerBox accesses the 
database DDD to obtain all the attributes of the relation specified by the in- 
put SQL query. It adds slots with these attributes to the base box of the tem- 
plate composite box. In the later process, the record materialization assigns each 
record value to a copy of this template box, which decomposes this record value 
to its attribute values and store them in the corresponding attribute slots of the 
base box. 

The third component in the example is a RecordFilterBox, which allows us 
to specify attribute attr, a comparison operator 0 , and a value v. This spec- 
ification modifies the input query by adding a new condition attr 0 v in its 
’where’ clause. The RecordFilterBox accesses the database DDD to know all the 
accessible attributes. 

The last component in this example is a ContainerBox with four more compo- 
nents, an OriginBox, and three AxisBoxes. A ContainerBox accesses the database 
with its input query, and materializes each record with the template composite 
box. While an OriginBox specifies the origin of the coordinate system of the ma- 
terialization space, each AxisBox specifies one of the three coordinate axes, and 
allows us to associate this with one of the accessible attributes. It also normal- 
izes the values of the selected attribute. These two components also uses query 
modification methods to perform their functions. 

In addition to the components used in the above example, the framework 
provides two more components, a JoinBox and an OverlayBox. A JoinBox ac- 
cepts two input SQL queries, and defines their relational join as its output query. 
It allows us to specify the join condition. An OverlayBox accepts more than one 
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query, and enables a ContainerBox to overlay the materialization of these input 
queries. From the query modification point of view, it outputs the union of input 
queries with template specifications. 

By using a ContainerBox together with an OriginBox and AxisBoxes as a 
template composite box, we can define a nested structure of information mate- 
rialization as shown in Figure 2. The displacement of the origin of each record 
materializing ContainerBox from the map plane indicates the annual production 
quantity of cabbage in the specified year at the corresponding prefecture, while 
each record materializing ContainerBox shows the cabbage production changes 
during the last 20 years. 

It 



Fig. 2. A nested structure of information materialization. 




As an application of our information materialization framework, we have 
been collaborating with Gene Science Institute of Japan to develop an interac- 
tive animation interface to access cDNA database for the cleavage of a sea squirt 
egg from a single cell to 64 cells. The cDNA database stores, for each cell and 
for each gene, the expression intensity of this gene in this cell. Our system that 
was first developed using our old information materialization framework without 
query modification components animates the cell division process from a single 
cell to 64 cells (Figure 3). It has two buttons to forward or to backward the 
division process. When you click an arbitrary cell, the system graphically shows 
the expression intensity of each of a priori specified set of genes. You may also 
arbitrarily pick up three different genes to observe their expression intensities in 
each cell. The expression intensities of these three genes are associated with the 
intensities of three colors RGB to highlight each cell of the cleavage animation. 
The wire-frame cube that encloses the whole egg performs this function. Keeping 
this highlighting function active, you can forward or backward the cell-division 
animation. The development of this system took only several hours using the ge- 
ometrical models of cells that are designed by other people. The cDNA database 
is stored in an Oracle DBMS, which IntelligentBox accesses using Java JDBC. 
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We have applied our new information materialization framework to the same 
application. This extension enabled us to dynamically construct the same func- 
tionality within 15 minutes without writing any program codes or any SQL 
queries. 




Fig. 3. Information materialization of the gene expression in the cleavage. 
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Abstract. Rapid growth of digital data collections is overwhelming the 
capabilities of humans to comprehend them without aid. The extraction of 
useful data from large raw data sets is something that humans do poorly. 
Aggregation is a technique that extracts important aspect from groups of data 
thus reducing the amount that the user has to deal with at one time, thereby 
enabling them to discover patterns, outliers, gaps, and clusters. Previous 
mechanisms for interactive exploration with aggregated data were either too 
complex to use or too limited in scope. This paper proposes a new technique for 
dynamic aggregation that can combine with dynamic queries to support most of 
the tasks involved in data manipulation. 



1. Introduction 

Current technologies have enabled massive collections of data. Newer and faster 
algorithms for data analysis are always in demand to harness the flood. If the amount 
of data can be reduced to a manageable size, then humans can find patterns that 
automated algorithms may have missed. Dynamic Queries (DQ) is an interactive 
technique for data exploration [1]. Users manipulate sliders to filter out data. Each 
slider corresponds to an attribute of the data. A requirement of dynamic queries is that 
the visualization must keep up with the user’s manipulation within 100 milliseconds. 
Since a large portion of the computer’s computation is spent on visualization, when 
the datasets grow, the time to complete drawing grows proportionately. Thus DQ isn’t 
suitable for dealing with large amounts of data. 

Large datasets poses two problems to interactive exploration. One is how to 
represent the elements on the screen fast enough. Second is if you draw it on the 
screen, can the user even understand it. Visual occlusion is a problem in general for 
visualization. If the user can’t see the data point, then the time spent drawing the item 
was wasted. This problem can be solved for small numbers of items. The commercial 
data analysis package, SpotFire (www.spotfire.com), randomly jitters the data points 
continuously, so that clusters that occupy the same point can be seen. With larger data 
sets, the occlusion problem grows even more pressing, due to the non-uniform nature 
of most data sets. The visual representation can deceive users by not showing clusters 
that exist in the data. 
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Aggregation is an effective way of managing large data sets. It summarizes groups 
of similar data elements and can greatly reduce the number of glyphs that are shown 
on the screen. Because users can specify how to aggregate the data, the important 
aspects of the data will be preserved while the dataset size is reduced. Patterns that are 
hidden within millions of data points can emerge dramatically when aggregation 
reduces these into thousands of points. Fredriksen et al. [5] explored using aggregated 
data in conjunction with SpotFire, and demonstrated the uses of different kinds of 
aggregation with highway incident data. Hochheiser and Shneiderman [6] used 
aggregation to interactively explore web log data. In their study, the aggregation was 
done manually through SQL queries, though integration with an aggregation tool was 
suggested as a future direction. 



2. Related Work 

Putting aggregation and Dynamic Queries together in one interface is not a new idea. 
Goldstein et al. [2] proposed it in 1994. An interface mechanism called Aggregate 
Manager (AM) was combined with DQ, which produced a powerful combination 
(Figure 1). DQ is used to select a subset of the data set; this is transferred over to AM 
as an aggregate group. AM can then do aggregation on different aggregate groups, 
and pass the data back to DQ for display. This loop fulfills one of the lacking area of 
DQ: providing conjunct of disjunctive groups. Using AM along with DQ provides 
many possible combinations for data manipulation, which is powerful but can be hard 
for users to understand and fully control. 




Fig. 1. The workspaces of AM with DQ Fig. 2. SolarPlot showing a histogram 



An alternative approach to user-controlled aggregation is automatic aggregation. 
Chuah and Roth [3] used automatic aggregation in SolarPlot, a circular histogram 
(Figure 2). Elements are mapped to a pixel on the circumference of a circle; the height 
of a spike that emanates from the pixel represents the number of data values that fall 
with in that pixel. This aggregation is intuitive and simple, the scale of the 
aggregation depends on the diameter of the circle, and the aggregated value is easily 
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understood. SolarPlot only encode one dimension of data in the visualization, thus any 
correlations between fields are harder to find. 




Fig. 3. Close up and zoomed out view of Aggregate Towers 



Rayson’s [4] Aggregate Towers provide another automatic aggregation interface. 
Data points are displayed as cubes on a 3d plane. As the user zoom in and out, data 
points are clustered based on their geospatial location (Figure 3). Stacks pointing out 
of the plane represent the aggregate groups. The cubes still retain their original color- 
coding. This automatic technique alleviates 2D occlusion problem by forcing it in to 
3D. These stacks of data towers will occlude each other in 3D, but is easily remedied 
by allowing the user to freely rotate the view. 



3. A Simple User Driven Aggregation Interface 

Automatic aggregation is useful as a way to reduce occlusion. However, having no 
user control makes automatic aggregation of limited use for general datasets. 
Goldstein’s AM is complex and hard to use because the user has complete control and 
no automation. Our system represents a middle of the road approach. 

SpotFire’s user interface was used as the starting point of our system. In SpotFire, a 
scatter plot of two attributes of the data is at center of the screen. Combo boxes at the 
edges of the axes select the fields being plotted. A panel on the right side displays DQ 
controls and detail on demand. The entire interface is in front of the user. Our system 
has similar characteristics as SpotFire. The aggregation controls are located on the left 
side so that DQ can be placed on the right side. The primary aggregation control is a 
combo box that can be enabled or disabled (Figure 4). Specifying a group of data 
manually is easy using DQ. However, creating many such groups can be time 
consuming and should be automated. The user only needs to select a field to group on, 
by using the "Group by" widget, and have the program sort out the groups. The 
default grouping algorithm used is equivalence grouping. For numerical data, 
equivalence is when they represent the same value, thus 4 and 4.0 are the same and 
belong in the same group. For categorical/string data, a case sensitive string 
comparison is used to determine equivalence, thus "4” and "4.0" as string are not the 
same. Should the user require a different grouping criterion, clicking on the "..." 
button to the right of the combo box will bring up an options dialog. Here, the user 
can choose which algorithm to use and to configure the algorithm to their liking. If 
the groups that are created are not specific enough for the user, they can be broken 
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down into subgroups. E.g. in the case of census data, we can group the entries based 
on gender, then subgroup based on age brackets, creating meaningful groups that can 
be used in aggregation. Subgroups are also controlled by checkable combo boxes. A 
combo box labeled "Subgroup by" will appear under the "Group by” widget after the 
user has selected a field to group by. 

Once the grouping computation is finished, the results are shown on the screen 
with each dot now representing a particular group, the size of the dot is currently 
coded to show the number of elements in that group. The secondary aggregation 
controls are the aggregate method combo boxes. Those are located below the vertical 
axes field selector, and to the left of the horizontal axes field selector. The user can 
select different aggregation algorithms for each axis independently. 



4. System Demonstration 

The dataset used was extracted from web logs. The data is taken from University of 
Maryland’s Computer Science web server. Only the requests that belonged to the 
HCIL section of the website (www.cs.umd.edu/hcil) were extracted. This is similar to 
the dataset that Hochheiser and Shneiderman [6] explored in their study. The data 
have the following five fields: 

• Client host 

• time: timestamp of request 

• url: the URL requested 

• return code: the server response code to request 

• bandwidth used: number of bytes transmitted for that request 

Web log data is very large and has only a few data fields. Traditional web analysis 
packages create tables of statistics and static graphs. The user merely feed the data to 
the program, and it is the program that decides what to report back to the user. 
Hochheister and Shneiderman argued in their paper that interactive star field 
visualization, like SpotFire, is a valid way of analyzing web log data. However, in 
order to find some of the interesting features involved preprocessing and aggregation. 
Thus, using the same web data will be a good test of the flexibility and power of our 
simple aggregation interface. 

Since the web data consists of individual client requests, one logical grouping 
would be to group by user. By viewing the size of the groups, one can detect 
abnormally large numbers of requests from a particular user. We find that the Google 
spider the most frequent visitor of HCIL. To find out how much bandwidth Google 
consumed, we change the field we are viewing to “Bandwidth used” and set the 
aggregator function to sum the field (Figure 4). We found that it isn’t Google, but 
another crawler, EoExchange that is using the most bandwidth. To view the access 
patterns of the clients, we can subgroup based on the time of access. Figure 5 shows 
access patterns of users over days. The bandwidth hog EoExchange shows up in this 
graph as well, while Google’s accesses are well hidden and spread out across days. 
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5. Conclusion 

We have developed simple manual aggregation interface that we believe the users can 
understand and use effectively. However, due to the inherent complicity of the 
aggregation concept, users should have in mind a specific question they would like 
answered. Unlike DQ, in which users can explore and experiment with data, 
aggregation should be thought of as creation of a new dataset. This new dataset can 
then be explored by DQ. A usability test should be conducted to test how readily users 
understand using the interface and which grouping algorithm and aggregation 
algorithm are needed to have a rich set of tools so the user can find answers to more 
complex questions than what was considered in the paper. 
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Abstract. Electron velocity distribution obtained by direct spacecraft 
observation in space is contaminated by photoelectrons. The photoelec- 
trons are generated due to the solar ultraviolet ray, and are regarded 
as artificial noise from a viewpoint of scientific research. We propose a 
method for separating photoelectron component from ambient electron 
component. Our method uses multivariate normal mixture model, whose 
parameters are determined via the Expectation-Maximization (EM) al- 
gorithm. Initial parameters of the EM algorithm are computed through 
the classification of the velocity space by a spherical surface of some 
arbitrary radius. 



1 Introduction 

The process of knowledge discovery begins with data acquisition and ends with 
identification of a new pattern in data. Between the start and the goal, the pro- 
cess includes various steps what we roughly call “data analysis.” In the scheme 
proposed by Fayyad et al. [2], the process consists of (1) data selection, (2) data 
preprocessing, (3) data transformation, (4) data mining (hypothesis generation), 
and (5) hypothesis interpretation / evaluation. Creation and development of the 
computational strategy for such steps enable us to reduce the time in achieving 
the knowledge discovery, and are indispensable in dealing with large database. 
We have demonstrated that the multivariate normal mixture model is an ef- 
fective tool for characterizing an observation of three-dimensional space plasma 
velocity distributions [6,7]. The normal distribution is called as the Maxwellian 
distribution in the plasma physical field. That is, when multiple peaks exist in the 
observed velocity distribution, these peaks can be well represented by the mul- 
tiple Maxwellian distributions that compose the mixture model [4]. We applied 
the mixture model to the ion velocity distribution and determined the parame- 
ters of the model through the Expectation-Maximization (EM) algorithm [1,3]. 
This procedure is regarded as a step of “data mining” in the scheme of Fayyad 
et al.[2]. 

In this paper, we present that the similar procedure can be applied to the 
“preprocessing step” of the analysis of electron velocity distributions with minor 
modification. Since a spacecraft in sunlight is irradiated by the solar ultraviolet 
ray, photoelectrons are produced from illuminated surface material. The spa- 
cecraft is then charged to positive potential relative to the ambient plasma. 
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which attracts the photoelectrons emitted from the surface. When the electron 
measurement is carried out in such an environment, returned photoelectrons are 
detected together with the ambient natural electrons what is originally expected 
to be observed. Since the amount of those photoelectrons is comparable or even 
larger than that of ambient electrons, it is difficult to obtain the real informa- 
tion from the data. This difficulty have prevented the progress of the quantitative 
study of electron dynamics. 

However, we found that the two-component Maxwellian mixture model can 
represent the photoelectron and the ambient electron by the two component mix- 
ture model. While the algorithm used is similar to that in the previous work [6], 
a new algorithm has been developed in the part of setting the initial parameters. 

2 Data 

We used electron velocity distribution obtained by the Low Energy Particle 
Energy-per-charge Analyzer (LEP-EA) on board the Geotail spacecraft. LEP- 
EA measured three-dimensional velocity distributions by classifying the velocity 
space into 32 for the magnitude of the velocity, 7 for elevation angles, and 16 for 
azimuthal sectors (Figure 1). 




Fig. 1. Classes for observation of an electron velocity distribution with LEP-EA (RAM 
B mode) 



We define a probability function of observed electron velocity Vpqr [m/s]: 



/ (Vpqr) 



fo i.'^pqr) dVpq:f‘ 

/o i'^pqr) dVpqr 



( 1 ) 



where /o (vpqr) [s^/m®] is an observed electron velocity distribution function, 
and dVpqr is the class interval whose class mark is Vpqr- Subscription p, q and r 
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are indicators of the magnitude of the velocity, elevation angle, and azimuthal 
sector, and they take integers p = 1, • • • , 32, g = 1, • • • , 7, and r = 1, • • • , 16. 



3 Method 



3.1 Multivariate Maxwellian Mixture Model and EM Algorithm 

We approximate the probability function (1) by the mixture model composed of 
the sum of two multivariate Maxwellian distributions: 

/( Vpqr) — 'y ' nigi {Vpqr\Vi, Tj) , (2) 

ph,am 

where rii is the mixing proportion of the Maxwellians (X)i=ph am ^ — 
rii < 1). Notations “ph” and “am” mean photoelectron and ambient electron, 
respectively. Each Maxwellian is written as 



9 i ^ ij Tj) 

1 



exp 



2 y^pqr T j) 1 j 



{Vpqr j) 



(3) 



where m-e [kg] is the electron mass, Vj [m/s] is the bulk velocity vector and Tj [J] 
is the temperature matrix of i-th Maxwellian. The log-likelihood of this mixture 
model becomes 



f (^pqr) log y ' ^igi {Vpqr\^ Tj) , (4) 

p,q,r i=ph,am 

where 9 = (riph, Vph, Vam, Tph, Tam) denotes the all unknown parameters, 
and N is the total number of the particle count. 

Partially differentiate (4) with respect to Vj and T^^ {i = ph, am) and put 
them equal to zero, we obtain the equations that should be satisfied by the para- 
meters as maximum likelihood estimators. Utilizing these equations, we estimate 
the unknown parameters through the iteration of the EM algorithm with regar- 
ding posterior probabilities as unmeasured data [5,6]. We finish the iteration 
when the log-likelihood and unknown parameters become unchanged in the ite- 
ration. 



3.2 Initial Parameters of EM Algorithm 

To reduce an iteration of the EM algorithm, a proper setting for the initial 
parameters is desirable. We used the fc-means algorithm for setting the initial 
parameters for the iteration of the EM algorithm in the previous work [6]. Ho- 
wever, since the fc-means algorithm is a clustering algorithm for an exclusive 
division, it is not applicable in setting initial parameters for a mixture model 
whose bulk velocities are close to each other. 
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Now we approximate an observation of electron distribution by photoelec- 
trons and ambient electrons. The parameters are expected to satisfy 



^ph ^ ^am; (5) 

Vph 0, (6) 

l^aml < l/2trTph/3me < \/2trTam/3me. (7) 

This is the case when the /c-means algorithm does not work well. We then adopt 
the following method suitable for such a distribution. 

1. Divide the 32 classes about the magnitude of velocity (radial direction in 
the velocity space) into two groups by a certain boundary of radius R {R = 
2, 3, •••,31). 

2. Compute mixing proportion, bulk velocity vectors, and temperature matrices 
for both groups by usual moment calculation procedure. 

3. Set these value as the initial parameters of the EM algorithm. 

4 Application 

The top panel of Figure 2 shows an observation of electron velocity distribution 
which was obtained in the time interval 1420:00-1420:12 on January 16, 1994. 
Displayed two lines are densities in the velocity space along the and Vy axes. 
We find high density around the origin (vx = Vy = 0), which corresponds to the 
photoelectrons. When applying the two-component Maxwellian mixture model 
to the data, we obtained photoelectron component and ambient electron compo- 
nent separately as shown in the bottom-left and bottom-central panels of Figure 
2, respectively. The estimated parameter are given in Table 1. The sum of the 
two components are also shown in the bottom-right panel. 



Table 1. Estimated parameters for the two-Maxwellian mixture model in the time 
interval 1420:00-1420:12 on January 16, 1994. The value of n is the mixing proportion 
multiplied by the total number density 



Electron 


n [/cc] 


14 [km/s] 


Vy 


14 Txx [eV] Txy 


Txz 


Tyy Tyz 




photo 


3.819 


-534 


257 


123 7 0 


0 


7 0 


7 


ambient 


0.064 


270 


-100 


-87 144 -4 


1 


131 1 


122 



In setting the initial parameters of the EM algorithm, it may matter how to 
select the cutting radius R which potentially classify the data into photoelectron 
and ambient electron components. We found, however, that the results after the 
iteration of the EM algorithm are the same in most R selection. In this example, 
we can obtain the same result for i? = 2 to 25. 
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Observation 



GEOTAIL LEP-EA-e RAMB 
1420:00-1420:12 UT on January 16, 1994 




Vy [km/s] 



Photoelectron 



Ambient Electron 



Photo + Ambient 




- 10000-5000 0 5000 10000 

Vy [km/s] 




- 10000-5000 0 5000 10000 

V,, Vy [km/s] 




- 10000-5000 0 5000 10000 

[km/s] 



Fig. 2. Observation of electron velocity distribution along the Vj; and Vy axes between 
1440:00-1440:12 UT on January 16, 1994 (top). Bottom panels are estimated com- 
ponents for photoelectron (left), ambient electron (center), and the sum of the both 
(right). Two vertical broken lines in the top panel indicate electron velocities equivalent 
to the spacecraft potential at this time interval 



5 Discussion 



Traditionally, the spacecraft potential was utilized to decompose the electron 
data into photoelectron component and ambient electron component. When the 
spacecraft potential is <j) [V], the equivalent electron speed Vg = \/2e(/)/me [m/s], 
where e [C] is the elementary electric charge. This means that a photoelectron 
particle whose speed is less than Vg cannot escape from the spacecraft and is pul- 
led back to the spacecraft. Therefore, density of particles slower than Vg would 
be contributed by photoelectrons as well as ambient electrons. On the assump- 
tion that the photoelectrons were distributed within |d| < Vg, they thus did not 
use the density of the slow particles and interpolated the density of slow speed 
particles by the density of particles faster than Vg. However, the equivalent elec- 
tron speed Vg is not so accurate as an indicator of the photoelectron distribution. 
In the same time interval (1420:00-1420:12 UT on January 16, 1994), the spa- 
cecraft potential (p = 26.09 V and then Vg = 3029 km/s, which is shown in the 
top panel of Figure 2 as ±Vg by two vertical broken lines. The two lines are 
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located at smaller velocities than our expectation (ve ~ 4000 km/s), and will 
give an inappropriate interpolation. 

Since our method works automatically with less computational burden, it 
can compute the macroscopic quantity of the ambient electrons (nam, V'am) and 
Tam) on board the spacecraft. It will be useful under the limited transmission 
resources from the spacecraft due to the telemetry constraint. 
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Abstract. This paper reports on a logical model of the rational use of theory in 
a particular discovery problem in neuropharmacology, based on a case study of 
a practice of drug research for Parkinson’s disease. This analysis describes how 
qualitative assumptions about the relation between properties of the nervous 
system are used to search for drug leads, i.e. properties for possible drug 
interventions. The logical structure of this drug lead discovery problem is 
briefly described together with the structure of some assumptions from the case 
study. It is briefly discussed how computational tools where used to explore 
these assumptions, and how they could possibly aid discovery in this domain. 



1 Introduction 

The study of scientific discovery is a subject that has a long tradition in philosophy of 
science and logic. The questions and methods in philosophy of science and logic 
usually focus on fundamental and normative matters that are often abstract and seem 
to lie far away from scientific practice. In one of the efforts of our department to 
bridge this gap I extensively studied a practice of neuropharmacology, in particular 
the drug research for Parkinson’s disease conducted at the Groningen University 
Center for Pharmacy. Based on this study, where I interviewed and followed 
practitioners during their experiments, I modeled the logical structure of used theories 
and assumptions, and different types of problems that led to conceptual and empirical 
discoveries, cf. [1]. In this paper I first briefly describe the logical structure of rational 
drug lead discovery, one of the types of discovery problems I encountered. Secondly, 
I illustrate this problem with an example from the case study. I end with a discussion. 



2 Rational Drug Lead Discovery 

A main goal of drug research is to discover and design drugs and drug treatments. In 
the rational search for a drug treatment knowledge of biological processes is used to 
infer the effect of a drug intervention. A suggested intervention can either contain a 
description of a desired local influence of a drug on a biological system, or a 
description of a drug that is known to have the needed functional properties. These 
desired properties of a drug should cause a decrease in disease symptoms, and are 
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called a drug lead, cf. [2], The rational search for a drug lead can be described as a 
problem of qualitative reasoning. Knowledge of qualitative relations between 
variables describing properties of a pathological biological system can be sufficient to 
find variables that can influence that system in a desired way, cf. [3]. 

The search involved is structurally similar to that of explanatory or abductive 
reasoning, but has a different search goal. Instead of finding a simple hypothesis that 
explains an observed behavior, the task is to find a minimal intervention that has a 
desired effect on properties such as the behavior of the system, with minimal side 
effects. So, analogously to inference to the best explanation, this process can be called 
inference to the best intervention. 

The object of drug treatment design does not initially concern the properties of a 
compound as in drug design, but the properties of a biological system, an organism. In 
the latter the goal is to create a drug so that it has given desired properties, in the 
former the goal is to create the behavior of a biological system so that it has given 
desired properties. These properties can be divided in structural and functional 
properties. A disease can be represented as a set of unwanted properties of a 
biological system. These can be compared with wished for properties of a system. 

So we can define the characteristics of a disease as follows. Given the operational 
properties 0(x) of a pathological system x and the wished for properties W, the 
characteristics of a disease y can be defined as W A 0(x), the symmetric difference 
between 0(x) and W: 



W A 0(x) : = W - 0(x) u 0(x) - W 

The set 0(x) contains all the considered properties of a system x, not only the 
pathological properties. So the set W n 0(x) is not empty. The goal of drug treatment 
is to change the properties 0(x) of system x to 0*(x) such that both 0*(x) - W and 
W - 0*(x) are minimized 

Rational drug treatment discovery involves finding a drug treatment for a given 
pathological condition of a system by maximally employing known theories and 
knowledge about biological processes. A proper theory about a disease should be able 
to imply the pathological properties. 

So, let a set H of theories about biological processes be given as well as 
background assumptions B(x) involved in the explanation of the observed properties 
among the properties 0(x) of a pathological system x. The problem of the design of a 
drug treatment of the pathological properties 0(x) A W is to cause only wished for 
properties from W by a drug intervention I(x) of the system, i.e. H u B(x) |=I(x) ^ 
W. If a theory can imply the pathological condition, then we can use that knowledge 
to search for a suitable intervention, see Table 1. 

Table 1. Logical structure of the rational drug lead discovery problem 

Start : H u B(x) |= 0(x) 

Goal : H u B(x) |= I? -4 W 

Result : I*(x) 
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The search goal is to find, by reasoning about processes represented in H, a proper 
drug intervention that influences processes that cause the desired properties W, but 
not those from 0(x) - W. That is, the goal is to eliminate the difference between W 
and 0(x). The result of the search is the suggestion of a manipulation of a local 
biochemical property that can be affected by a drug. A drug that has this wished for 
functional effect can be searched for in the set of known drugs, or pose a new problem 
for rational drug design. 

Of course it would be ideal, given the known H and the nature of the disease, to 
infer a suggestion for a drug intervention I that only causes W. A drug usually also 
causes side effects, often creating undesired effects that are not part of the disease that 
is targeted. Therefore we need a gradual evaluation criterion for the improvement of 
suggestions, cf. [4]. Let us say that the moderated design goal is to find the suggestion 
I(x) such that its (predicted) consequence for a system, H u B(x) |= I(x) — > P(x), 
resembles the desired condition W more than the pathological condition 0(x), i.e. 
that: P(x) A W is a proper subset of 0(x) A W. That is, the drug should not have more 
unwanted consequences than accomplished desired consequences, cf. Fig. 1 . 




Fig. 1. Problem state in searching an intervention with effect P(x), in a space of relevant 
properties RP, that most resembles desired properties W in treating a pathological system x 
with operational properties 0(x) 

The evaluation of improvement of more than one drug suggestion can follow the 
same lines. A drug intervention I* of x is better than an intervention I if the properties 
of consequence P* resemble W more than those of P, i.e. P*(x) A W is a proper subset 
ofP(x) A W. 

Flowever, this is only an evaluation of properties that is neutral to the different 
kinds of undesired properties. In this way an intervention could be inferred that treats 
most of the symptoms, but causes a symptom that is worse than the disease that is 
treated. This could be remedied by a ordering of the undesired properties, together 
with a measure of deviation. 

The resulting suggestion for a drug intervention can on its turn be used to test the 
theories used to find the suggestion. Given an inferred drug intervention I(x), an 
experiment can be done and its resulting observation of the altered operational 
properties 0(x) of x can be compared with the predicted properties P(x). A 
discrepancy can be used to redesign H, or the assumptions about B(x) or I(x). 
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3 Case Study: Drug Research for Parkinson’s Disease 

In Parkinson’s disease patients suffer from motor behavior impairment. The cause of 
this disease is traced back to degeneration of a particular brain area called the 
substantia nigra pars compacta (SNC) that supplies the neurotransmitter dopamine 
to a brain area that is called the basal ganglia. Dopamine regulates the activation of 
the hrain area called the substantia nigra retaculata (SNR), hy excitating dopamine 
receptors of type D1 and inhibiting receptors of type D2. When the amount of 
dopamine depletes in Parkinson’s disease, this balance is disrupted, resulting in a 
pathological increase of the activation of the SNR. The drug L-dopa, a precursor in 
the metabolism of dopamine, treats the disease. However, since this intervention acts 
on all dopamine receptors in the body it causes undesired side-effects. Currently 
selective drugs are searched for that only target a relevant subtype of the dopamine 
receptor in the basal ganglia. A problem for finding a treatment is to discover what 
subtypes to target. This search problem is logically reconstructed. 

It is assumed that if a theory explains a proposition, then that theory should also be 
able to logically imply that proposition. A qualitative model of the basal ganglia was 
constructed that can logically imply the consequences of decreasing the amount of 
dopamine. Knowledge about the basal ganglia can be represented as a qualitative 
theory about a dynamical system, defined as a tuple (V, Q, C), where V is a set of 
variables which are reasonable functions over time, Q is a set of quantity spaces for 
those variables, and C is a set of constraints on variables in V. For the basal ganglia 
theory I used two basic variables describing firing rate (/) of nerve cells in a cell 
group, nuclei or pathway, and the amount {a) of a particular neurotransmitter released 
in the vicinity of a cell group, nuclei or neural pathway. The constraint relation y = 
X is used to state that the change of values of y over time is positively /negatively 
monotonically related to the change of value of x. 

Figure 3.2 displays a fragment of the basal ganglia model containing of the 
cell nuclei called striatum, SNC, Gpe, SNR, and the neurochemicals L-Dopa, 
dopamine, GABA, and glutamate, for further details see [1] and [5]. For example, the 
increase of the firing rate of the SNC causes an increase in the amount of dopamine in 
the striatum, while this latter increase causes a decrease in activation of the neural 
pathway that signals to the Gpe, propagating further through the neural circuitry. 
Given the model it can be deduced that a decrease of the amount of dopamine implies 
an increase of the firing frequency of the SNR, i.e\. 

Hgg u B: {a(DA, striatum) = dec} |=P: {/(SNR) = inc} 

The use of qualitative models of biological processes could help to explain and 
find suggestions for possible treatments. The discovery problem for a treatment can 
be defined as a search in a solution space of conceptually possible interventions. We 
start with a qualitative model and known initial values of its variables. A goal of 
desired variable values is set. Reasoning backward from the goal values one can 
explore possible manipulations of the variables. The approximation criterion, as 
defined in the former section, can then be used to measure the difference between the 
goal values and the values caused by a particular manipulation, implementing a 
means-end analysis. 
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Fig. 2. A part of Hg(,, a qualitative model of some assumptions about the basal ganglia 

In Parkinson’s disease, the goal set includes a lower activation frequency of the 
SNR than in the pathological case. A search through possible manipulations will not 
only find an increase of the amount of L-dopa in the striatum. It will also find that a 
decrease of the firing rate of the indirect pathway between the striatum and the GPe 
results in a decrease of the firing rate of the SNR. Administering a selective D2 
agonist can cause such a decrease, with a lesser effect on dopaminergic pathways in 
other parts of the body than the effect of L-Dopa. These kinds of suggestions were on 
its turn used to empirically test the basal ganglia model, cf. [1]. 



4 Discussion 

This logical reconstruction tells us nothing new about what to do about Parkinson’s 
disease. Yet by making the knowledge and reasoning explicit (by describing it 
formally) it is possible to increase the complexity of models that are used in practice, 
such as that of the basal ganglia. Via a computer program as a modeling tool it is 
possible to keep track of, and further investigate, all the consequences of such a 
model. The conceptual space of the basal ganglia model was explored using the 
qualitative simulator QSIM cf. [6], and GARP, a general architecture for reasoning 
about physics, cf. [7], yet both do not implement abductive discovery operators. An 
implementation of abduction and inference to the best intervention in the context of 
this domain is part of our ongoing research. 
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However, the bigger problem to make such tools useful in practice is the 
availability of biological theory in a formal representation. It would be ideal if 
scientists in biology and medicine would publish their results both in natural language 
and in a formal format. It would already provide a much clearer view on results if it 
would be qualitatively stated whether investigated parameters where found to be 
positively or negatively related. Different publications about a domain taken together 
would provide a search space that could be relatively manageable, leaving the details 
of testing discovered interesting hypotheses to further empirical research based on 
such suggestions. 
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Abstract. We present a record extractor system SCOOP. We assume 
that semi-structured documents given to SCOOP contain similar formats 
and each of them has only a record consisting of some different fields. 
SCOOP treats a document as just a string and does not use knowledge 
on input except that a field is surrounded with delimiters, a left delimiter 
ends with “>”, and the corresponding right delimiter begins with “<”. 
By counting substrings, SCOOP roughly divides into two parts: contents 
of the fields and others. SCOOP counts substrings near boundaries of two 
parts and extracts the most frequent substrings as delimiters. We show 
experimental results with news articles written in English or Japanese. 
A record consists of the headline and the body text on this experiment. 
SCOOP extracts records at a high rate. 



1 Introduction 

The number of Web pages is extremely increasing. These pages contain useful 
data. Nevertheless the structure of the data is described in some pattern of 
strings and not explicates compared to data in database systems. That is why 
we call them semi-structured documents [1]. A major target of Web mining is a 
set of semi-structured documents. 

An important application of Web mining is extraction contents of semi- 
structured document as records. A record is a basic notion of database systems. 
A database consists of records, and a record consists of some fields. A field is 
the minimum unit in a database. To use Web pages like a database system, one 
needs wrappers that extract contents of the pages. In this paper, we describe a 
system that extracts contents of Web pages as records without any knowledge 
on input documents. 

An input for our system is a set of semi-structured documents which contain 
similar formats such as Web pages in the same site, or pages generated auto- 
matically with search facility. There are two types of the problem for record 
extraction depending on the number of records contained in an input file [2]. 
In the single instance problem, input is a file which contains many instances of 
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the same record. In the multiple instance problem, input is a set of files each of 
which contains an instance of the same record. We consider the multiple instance 
problem. 

A format of Web pages is usually described with some patterns of HTML 
tags. The record structure of Web pages is written in such a way that each 
field is packed with a pair of tag sequences as the left parenthesis and the right 
parenthesis. In [3], Atzeni and Mecca implemented a language in which one can 
describe a wrapper by specifying these tag patterns. In [6,7], Kushmerick, Weld 
and Doorenbos applied machine learning techniques for extraction of these tag 
sequences. In [8], Sakamoto, Arimura, and Arikawa proposed a wrapper atten- 
tioned to the tree structure of HTML documents. Most of these approaches 
require some instances of records for learning or use some knowledge such as 
the type of used tags. For example, the wrapper in [7] has to know the position 
of records. In [4], Embley, Jiang and Ng showed some boundary detection tech- 
niques. But they assumed that the boundaries are determined by the tags hr, 
td, tr, a, table, p, br, h4, hi, strong, b and i. 

We do not use such instances or knowledge. We treat a document as just 
a string. All knowledge on input is that a left delimiter ends with “>” and 
the corresponding right delimiter begins with “<” . Most of the field boundaries 
experimentally found are substrings of tag sequences such as “txb>” and 
“</bxhr><”. 

We assume the following five heuristics to guess structures of HTML files: 
frequent strings are not the contents of fields [5], each field is surrounded with 
two substrings of tag sequences, instaces of the same field are surrounded with 
the same pair of delimiters, each file contains an instance of the same record, 
and a left delimiter ends with “>” and the corresponding right delimiter begins 
with “<”. 

SCOOP has three steps: (1) SCOOP uses FindOptimal developed in [5] to 
divide roughly into two parts by counting substrings: contents of the fields and 
others. FindOptimal assumes that frequent substrings express structures of doc- 
uments and are not the contents of fields. (2) SCOOP counts substrings near 
boundaries of two parts and extracts the most frequent substrings as delimiters. 
(3) The delimiters provide for SCOOP to extract fields. This is based on the very 
simple idea to count frequent substrings twice, but SCOOP extracts records at 
a high rate as given news articles written in English or Japanese. 



2 SCOOP System 

SCOOP is a system which extracts records from semi-structured documents with 
similar formats and outputs a list of records (see Fig. 1). A pair of delimiters 
surrounding each field is called a rule. A left delimiter is called a startstring 
and the corresponding right delimiter is called an endstring. 

On Preprocessing of Fig. 1, SCOOP utilizes the algorithm FindOptimal de- 
veloped in [5]. FindOptimal also receives a set of semi-structured documents 




484 



Y. Yamada, D. Ikeda, and S. Hirokawa 




Fig. 1. The outline of SCOOP system 



with similar formats and divides roughly into two parts: contents of the fields 
and others. FindOptimal also treats input documents as just strings. 

FindOptimal outputs a pair (n,a), where n denotes a length and 0 < a < 
100 denotes a percentage. Consider that all substrings with length n of input 
documents are sorted by the number of their occurrences in the decreasing order. 
If a substring with length n is in the top a-percent of the sorted list, then we say 
that the substring is frequent on (n, a). We put gray color on frequent substrings 
of each input string like accbaachc, where accbaacbc be a part of an input string, 
and ba and cb are frequent on some pair. FindOptimal assumes that frequent 
substrings express structures of documents and are not the contents of fields. So 
black substrings cover with contents of the fields and gray substrings cover with 
others. 

FindOptimal finds (n, a) which attains a locally minimum alternation count. 
An alternation count is the number of alternations between black and gray sub- 
strings. In [5], it is experimentally shown that, given news articles written in 
English or Japanese, FindOptimal divides into the contents of the fields and 
others with more than 95% accuracy. 

This is the very high accuracy, but it is not complete. In this paper, we 
adjust an output of FindOptimal according to the followings: a black substring 
with length less than n is treated as a gray substring, a gray sequence of tags 
with length less than n among black is treated as a black substring, and a black 
substring is treated as a gray substring if the black substring is in a tag and 
is surrounded with gray substrings. SCOOP treats black substrings as fields on 
adjusted output. 

SCOOP receives an adjusted output and finds all rules surrounding each 
black substring on Record Extractor of Fig. 1. SCOOP counts substrings with 
length 2 surrounding each black substring. SCOOP increases the length of this 
substring by one if the most frequent substring is not unique on each document. 
It continues this until an extracted the most frequent substring becomes to be 
unique on each document. 

First, SCOOP finds “>” (the end of a start_string) which appears just be- 
fore the first black substring in each file. SCOOP counts 2-length substrings 
ending with this “>”. SCOOP increases the length of this substring by one if 
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the most frequent substring with length 2 is not unique on each document. And 
let the most frequent substring be start-String if it becomes to be unique on each 
document. 

Next, SCOOP similarly finds end_string which appears just after black sub- 
string and starts from “<” (the start of a end_string). And let this start_string 
and the corresponding end_string be a candidate of the first rule. But if the num- 
ber of files from which strings are not extracted by this rule is more than the 
half of the input files, SCOOP does not use this candidate. SCOOP proceeds the 
black substring which appears in the next of end_string until it does not extract 
a more rule. 

Finally, on Formatter of Fig. 1, SCOOP extracts strings inside each rule and 
outputs HTML file as a list. 

3 Experiments 

We implement SCOOP in Perl, and execute non-parallelly on Compaq Alpha 
Server GS320 (731MHz Alpha21264) on several sets of inputs. An input for 
SCOOP is a set of news articles in the same language, with similar formats and 
not including noises. An article we use is written in English or Japanese. It is 
provided as an HTML file which has a headline and a body text. 

First, we use articles obtained from “Los Angeles Times^” as English news 
articles. All pages linked from the URL are collected and then non-article pages 
are removed manually. The number of the articles in “Los Angeles Times” is 
150 files. The total size is about 4.7M Bytes. SCOOP extracts three fields, the 
first field is the title of each page, the second field is the headline and the third 
field is the body text. The contents of the first field equal to those of the second 
one and are surrounded with title tags. SCOOP extracts all fields from 150 files 
(100%). Fig. 2 is a part of an HTML file outputed by SCOOP. 

SCOOP preserves tag sequences in a field. For example, there are some 
“<br>” (means breaking a new line) in articles of “Los Angeles Times”, and 
SCOOP preserves them as shown in Fig. 2. This means that SCOOP only finds 
tags designating static structures. 

We expect that fields are only the headline and the body when we watch an 
HTML file used in the experiment. But SCOOP outputs that the title of each 
page is also a field. On the other experiments, there are some cases that SCOOP 
extracts some fields which we do not expect. 

Next, we use articles obtained from “Yomiuri On-Line^” as Japanese news 
articles. The number of the article in “Yomiuri On-Line” is 65 files. The total size 
is about 1.4M Bytes. SCOOP extracts two fields, the first field is the headline 
and the second field is the body text. SCOOP extracts the first and second field 
from 165 files (100%). 

On the other experiments, SCOOP fails to extract fields from several files. 
When SCOOP finds delimiters of a field, it begins at the boundary of black 

^ http://www.latimes.com/ 

^ http://www.yomiuri.co.jp/ 
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Nepal Premier Koirala Re-Elected 
Nepal Premier Koirala Re-Elected 

POKHARA, Nepal-Nepal’s prime minister has defeated a rebellion within his ruling Nepali 
Congress party, winning re-election as party president with 64 percent of the votes. 

Prime Minister Girija Prasad Koirala secured 936 votes against 507 for his competitor, Sher 
Bahadur Deuba, in voting Monday, officials said. Deuba conceded early Tuesday. 

A dissident group led by Deuba had tried to oust Koirala for failing to reduce crime and end a 
Maoist insurgency that has killed 1,500 people in the past five years. 

The Maoist rebels, who model themselves after Peru’s Shining Path guerrillas, are demanding an 
end to Nepal’s constitutional monarchy and the feudal social structure that remains in parts of the 
Himalayan nation. 

Koirala, who came to power in March after forcing his predecessor from office, has been prime 
minister twice previously. He has held the office for most of the 10 years since democracy was 
restored to the Himalayan country. 



Japanese Team Finds 3,554 Meteorites 
Japanese Team Finds 3,554 Meteorites 

TOKYO-Japanese scientists have found 3,554 meteorites in Antarctica during a three-week 
search, a collection that could yield clues about the rest of our solar system, a government official 
said Tuesday. 

The finds were made around the Yamato mountain range about 186 miles from Japan’s base on 
the rim of Antarctica, said Shigeru Kure of Japan’s science ministry. 

A meteorite is a meteor that survives the destructive effects of a flight through the atmosphere 
and falls to the ground whole or in pieces. 

Six members of the Japanese observation team took part in the latest search conducted between 
Nov. 19 and Jan. 10, Kure said. 

"Such a large number of meteorites discovered may include some rare ones that could help in 
finding the origin of the solar system, or the possibility of any traces of life on other planets," 
Kure said. 

In 1998, a total of 4,180 fallen meteors were discovered by the Japanese team in Antarctica -the 
largest number found in a single search, Kure said. 

To date, Japanese observation teams have found about 13,000 meteorites in Antarctica, about half 
of aU found there. 

On the Net: 

Japan’s Ministry of Education, Culture, Science and Technology: 
http://wwwwp.monbu.go.jp/index -e.html 



Fig. 2. A part of output of SCOOP given Los Angeles Times. This is a part of PS file 
generated by “Netscape” 



substrings in an output of FindOptimal. If many contents end with “</tagA> 
</tagB>” and other contents end with “</tagB>”, SCOOP can not extract con- 
tents end with “</tagB>” because SCOOP decides that end_string is “</tagA> 
</tagB>”. Therefore, the accuracy for some contents extraction is lower. But, 
we think if we use some knowledge of HTML, SCOOP can extract such contents 
at a high rate on this case. 

When FindOptimal is given files generated with search facility as input, 
FindOptimal extracts contents at a low rate. Some substrings in fields are fre- 
quent on these files, for example, query terms appear frequently in the result of 
search. So FindOptimal can not extract such substrings as the contents of fields. 
In such a case, SCOOP can not extract records. SCOOP is influenced by the 
accuracy of FindOptimal. 



4 Conclusion 



We implemented SCOOP system which extracts records from semi-structured 
documents with similar formats and outputs them as a list. We experimented 
with news articles written in English or Japanese. SCOOP extracted records at a 
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high rate although SCOOP does not use knowledge on input. Moreover SCOOP 
extracted a field which we had not expected. 

SCOOP assumes that frequent substrings express structures of documents 
and are not the contents of fields. And SCOOP extracts the most frequent sub- 
strings around black substrings as delimiters. 

If semi-structured documents given to SCOOP include noise, SCOOP ex- 
tracts incorrect rules and can not extract records. Thus, it is an important 
future work to guarantee noise-tolerance for SCOOP. And if semi-structured 
documents given to SCOOP have some instances of the same field on each doc- 
ument, SCOOP extracts as instances of different fields and can not extract as 
instances of the same fields. Thus, it is also an important future work for SCOOP. 
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Abstract. The meta-analysis of the challenging data set on the mutagenicity of 
nitroaromatic compounds has been performed. There are two ways of structure 
coding: standard topological indexes or so-called fingerprint descriptors. In our 
previous work, a unique structure coding by fingerprint descriptors was used for 
the discovery of mutagenes with GUHA-l-/- software system. GUHA can 
process nominal variables, which are transformed to binary strings in the course 
of computation. Any structure coding can then be used for GUHA. The data 
encoded by topological indexes were processed by GUHA-r/- software system 
as well. The hypotheses on the reasons for mutagenicity of nitroaromatic 
compounds were generated by GUHA-r/- for Windows. Processing of data 
encoded by topological indexes was rather demanding because of the large 
number of structure descriptors. Meta-analysis by combining fingerprint 
descriptors for a posteriori structure templates resulting from previous analyses 
and more flexible topological indexes seems to be more appropriate. 



1 Meta-analysis 

The aim of meta-analysis is to relate the performance of different machine-learning 
algorithms on the characteristics of data set [1]. The famous mutagenicity data set 

[2] represents the discovery challenge tackled by many researches. Muggleton et al 

[3] used Inductive Logic Programming [ILP] system Progol for mutagene discovery 
with [2] data subset. This subset was already known not to be amenable to statistical 
regression, though its complement was adequately explained by the linear model [2]. 
In [3] topological descriptors were used. The advantage of Muggleton ’s approach is its 
flexibility. Inokuchi et al [4], [5] used the principle of graph abduction for this data 
set. Inokuchi ’s approach is similar to our fingerprint descriptors coding. Inokuchi’ s 
approach is more flexible mining frequent graph substructures similar to our a priori 
fingerprint descriptors. On the other hand, the computation based on fingerprint 
descriptors is faster in the order of magnitude. We propose connection of both 
methods for generating fingerprint descriptors by graph abduction. Okada obtained 
important results using the Cascade Model [6]. The results of Okada's approach are in 
accordance with our results obtained both by the fingerprint and topological 
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descriptors mentioned below. Matsuda et al [7] apply Graph Based Induction 
Learning technique to extract typical patterns from graph data. 



2 Principles of GUHA Method 

Basic ideas of GUHA (General Unary Hypotheses Automaton) method were 
presented in [8] already in 1966. Starting notion of the method is an object. Each 
object has properties expressed by variables ascribed to this object. For example 
object can be a man with properties given by the variables of sex, age, color of eyes, 
etc. In order to make reasonable knowledge discovery we need to have a set of objects 
of the same kind, which differ in values of variables defined on them. 

The aim of GUHA method is to generate hypotheses on relations among the 
properties of the objects, which are in some respect interesting. This generation is 
processed systematically; the machine generates in some sense all the possible 
hypotheses and collects the interesting ones. The hypothesis is generally composed of 
two parts; from the antecedent and the succedent. The antecedent and the succedent 
are tied together by the generalized quantifier, which describes the relation between 
them. The antecedents and succedents are propositions on the object in the sense of 
the classical propositional logic, so they are true or false for particular object. These 
propositions can be simple or compound similarly to propositional logic. Compound 
propositions (literals) are usually composed of conjunction connective. Formulation 
of these propositions is enabled through original variable categorization. Given an 
antecedent and succedent, the frequencies of four possible combinations can be 
computed and expressed in compressed form as the so-called four-fold table (ff- 
table). General ff-table looks like this: 



ff-table 


Succedent 


Non(succedent) 


Antecedent 


a 


b 


Non(antecedent) 


c 


d 



Where “a” is the number of the objects satisfying both the antecedent and succedent, 
“b” is the number of the objects satisfying the antecedent but not the succedent, etc. 

A generalized quantifier is a decision procedure assigning 1 or 0 to each ff-table. If 
the value is 1, then we accept the hypothesis with this ff-table, if it is 0, then we do 
not accept it. The basic Fisher generalized quantifier defined and used in GUHA is 
given by Fisher exact test known from mathematical statistics. For each hypothesis, 
value of Fisher statistic given by values a, b, c, and d of ff-table is computed. Its 
value, simply said, describes the measure of association between the antecedent and 
succedent. The lower the value of Fisher quantifier is, the better association is. 

In [9] information content of rules obtained by mining procedure is proposed, which 
suggests a promising improvement of the procedure. 
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3 Data Preprocessing 

Mutagenicity data set was given in two tables. Both data sets can describe compounds 
in the same manner, therefore there can be redundancy in the data. This redundancy is 
unpleasant in the search for quantitative structure-activity relationships (QSARs), but 
the used method (GUHA) enables the choice of the best of redundant variables for 
dependency relation. 

All descriptors seem to be cardinal and so their preprocessing is necessary. We 
divided each variable into several intervals. Among the variables, there is a huge 
amount of features and indexes, which are mostly unknown to us, so we divided them 
into three intervals equifrequently (Low, Medium, High) automatically. That means, 
one interval - one variable category - involves about 75 cases. We omitted the 
hypotheses with medium activity from the interval (-0.1, 1.9) in the succedent. 

Some variables include only one value (0), and they cannot be useful anyhow and 
therefore they were omitted (a_nP, Fcharge, ...). Furthermore, the data could be used 
directly as the input of GUHAh-/-. 

Meta-data were input as additional fingerprint descriptors. 



4 The Results of Data Mining 

GUHA is used for generating hypotheses of the following type: 

"if the car is black and is cheaper than 50000 crowns, then the owner is a widower 

older than 50. " 

Most of variables were nominal or dividable into natural intervals. Now, the task is 
not only to find the hypotheses of the type: 

"LUMO from x to y causes Activity from xx to yy. " 

Such results can be substantially dependent on the interval division of variables. 
Therefore, we should try to find the variable (combination of variables) affecting 
mutagenicity. 

A correlation matrix of all variables was computed and hypotheses consisting of 
redundant variables were omitted (only the best of them were chosen). 

Our efforts were divided into four phases. Fisher quantifier was always used as the 
lead criterion in the search for hypotheses. The second important criterion was Prob 
(number of cases fulfilling the hypothesis divided by the number of cases fulfilling 
the antecedent) that characterized hypotheses in terms of an implication. 



Most of the hypotheses refer to Activity of the "High" value interval. For example the 
best hypotheses can be interpreted in the following manner: 
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1. Presence of XVIII structure fragment [6] increases the probability of Activity in 
the “High” interval. 

2. 6-5-6 condensed rings and LUMO “High” (high reduction potential) increase the 
probability of Activity in the “High” interval (in agreement with [6] and [7], 

3. More than one NO 2 groups in minimum tricyclic compound (f = 1 [2]) increase 
the probability of Activity in the “High ” interval. 

4. Presence of XVIII structure fragment [6] and the absence of mutagenes from [3] 
increase the probability of Activity in the “High ” interval. 



The most interesting hypothesis is the following, undoubtedly. This hypothesis has 
excellent both characteristics (Fisher and Prob). We could say, that it is the best 
hypothesis, we have generated, at all. 

5. “High” balabanj index (based on molecular graph distance index) and Polarity 
"Low" increase the probability of Activity in the “Low” interval. 



5 Conclusion 

Chemical interpretation of the most favorite hypothesis is the following: 

High Balaban’s connectivity topological index (based on molecular graph distance 
index) and low polarity implies low mutagenicity. 

Apart from this hypothesis representing new toxicological knowledge, several 
hypotheses on the reasons of mutagenicity (mutagenes) were generated using GUHA 
method. Some of them represent new toxicological knowledge. Other hypotheses are 
in accordance with toxicological evidence. [10] 

We presented a number of hypotheses discovered by GUHAh-/-. The next step should 
be studying these hypotheses and generating more precise hypotheses including three 
or more variables in antecedent, in accordance with knowledge of the variables. 

Our assumption that GUHA can be used in the search for interdependencies seems to 
be right. We tried to draw dependency graphs of the best hypotheses and they showed 
the trends. 

According to the theory of global interpretation of multiple hypotheses testing the 
global significance of our results was considered. From this point of view the results 
as a whole can be interpreted as sufficiently reliable knowledge on the universe of 
which the data form a random sample. 
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